Pascal Hitzler

A (very personal) bit of ISWC08 trendspotting

As ISWC08 is drawing to a close, it dawns to me that something which Frank van Harmelen has been forecasting for years is now happening, seemingless without conscious effort. He calls it Approximate Reasoning – have a look at his ESWC06 keynote. The basic idea behind it is to do reasoning over ontologies with a different focus, namely by giving up some reasoning correctness in order to gain better scalability.

And indeed, at ISWC08 I have seen a number of things which fit exactly into this corner (while at the same time the authors/programmers might not even be aware of it).

  • As part of the Billion Triple Challenge, Axel Polleres presented the SAOR system, which does approximate OWL reasoning by means of forward chaining rules. Now you can’t do OWL reasoning (in a sound and complete way) with forward chaining rules (and Axel knows this), so in the end you’re losing some consequences. But at the same time you do get some consequences when having to deal with large amounts of data.
  • Eyal Oren, also at the Billion Triple Challenge, presented the MARVIN system which performs approximate RDF reasoning by means of massive parallelisation. MARVIN comes out of the EU project LarKC, which is actually pursuing approximate reasoning on a large scale (pun intended). Edit: This one actually won the 3rd prize at the challenge.
  • Among the results presented at ISWC08, I found those by Claudia D’Amato on Statistical Learning for Inductive Query Answering on OWL Ontologies really amazing. She and her collaborators managed to do OWL instance retrieval without any deduction algorithm. Instead they used Support Vector Machines and learned which (named) OWL classes individuals belong to. The learning was done from a small sample set (generated by a reasoner), but the network was able to generalise from the data to achieve about 90% of coverage. In my opinion, this is something conceptually new and it is really remarkable that it works.
  • In a regular paper Eyal Oren also reported on using Evolutionary Algorithms for RDF query answering.

The above is only a selection of approximate reasoning related things at ISWC08. There was also the Workshop on Nature inspired Reasoning for the Semantic Web where related ideas were discussed. At the colocated Web Reasoning and Rule Systems conference, RR2008, there will be two papers on approximate reasoning (incidentially, with me as coauthor).

I foresee the importance of such approaches rising substiantially in the future (and I think it’s a safe guess since Frank also seems to think so). The Billion Triple Challenge series could become one of the driving forums for this. There are exciting times ahead!

Author: Pascal Hitzler, AIFB, University of Karlsruhe (TH), Germany

Jana Herwig

The Day after Freebase went RDF

So what’s been happening on the blogosphere after John Giannandrea’s keynote at ISWC and the revelation that Freebase now produces Linked Data from an RDF service

Tetherless World sums up the Freebase facts (e.g. 156,000,000 assertions made; 1370 published types; 75 domains; graph model, identity, web based) and further points out that ontology creation “is a social process, and both freebase and semantic wiki are tools that enable users to create ontological vocabulary without worrying too much on building a comprehensive ontology.”

Inkdroid notes that the RDF service release “is important news because Freebase is an active community of content creators, creating rich data-centric descriptions with a wiki style interface, fancy data loaders, and useful machine APIs.” This is followed up by a quick and handy tutorial how you can get machine readable data back from freebase using a URI with Freebase. Conclusion:

So why is this important? Because following your nose in HTML is what enabled companies like Lycos, AltaVista, Yahoo and Google to be born. It allowed for agents to be able to crawl the web of documents and build indexes of the data to allow people to find what they want (hopefully). Being able to link data in this way allows us to harvest data assets across organizational boundaries and merge them together. It’s early days still, but seeing an organization like Freebase get it is pretty exciting.

Yves Raimond was the first to wonder on the public W3C LOD mailinglist: “now, to see whether it links to other datasets :-) ” – the idea of having linked data without the linkage would indeed seem like love’s labour lost. Semantic Focus / James Simmons seconds: “One downside is the data doesn’t appear to link to external resources, in a sense walling itself in. It should be trivial to link the topics that came from Wikipedia back to Wikipedia as well as DBpedia (which would be killer, by the way).” This is followed up a later post, where James expresses concerns regarding the relationship DBpedia / Freebase: “Freebase may see a drop in userbase growth and participation if it becomes a mirror of DBpedia (or vice-versa) and the popularity once garnered by one project may shift towards the other, or away entirely.”

More News / Andrew Newman puts the Freebase RDF service release in context with Cathrin Weiss’ “250 million triples on your iphone” submission, iMoCo, to the Billion triples challenges, also DBpedia and Semaplorer, developed at the University of Koblenz:

DBPedia stood out because it was the only one that allowed you to write data to the Semantic Web rather than just read the carefully prepared triples. For a similar reason I though SemaPlorer was good because they tried to do more than just the standard triples but went that extra bit further by making it more generic like integrating flickr. But they were all excellent, all of them showing what you get with a billion or more triples and inferencing.

That combined with the guys at Freebase making all of their data available as RDF and it was a big day for the Semantic Web.

ARQtick / AndyS plays a bit with the Blade Runner example cited by Freebase, e.g. takes a look at the graph, looks for interesting properties and extracts author names

N.B. If you want to follow ARQtick’s example: use the Linked Data browser plugin Tabulator or go to the Marbles site to view the RDF – without a data browser you’ll be redirected to the HTML page. You will also need it to make sense of rdf.freebase.com.

Jana Herwig

The Future, Quantum Encryption, Privacy on the Social Semantic Web

Just two memos: There is a talk tonight with Thomas Länger from the Viennese quantum encryption project (BBC article about the project), co-organized by quintessenz (an organisation devoted to civil rights in the information age) and Transforming Freedom (who are dedicated to documenting the discourse of the battle zones of digital culture; I volunteer for them). ORF wrote a German article about it, with information about the venue and start time. The key issue quintessenz want to raise with this talk is: Who is going to benefit? Will “unbrekable” quantum encryption become available to citizens, too? Quantum encryption cartridges for your PC, anyone?

Secondly: I published an “inaugural interview” Marion Fugléwicz-Bren did with two of my colleagues, Matthias Samwald and Thomas Schandl (not so inaugural for the former, as he already joined SWC in January). I’d like to extract this quote by W3C member Samwald regarding privacy on the (corporation owned) social web and the future (user-managed) social semantic web:

I also think that Semantic Web technologies will receive a lot of media attention when the first big, public breach in security / privacy happens in one of the websites that currently dominate the whole world wide web. At the moment, we all are uploading most of our private and business lives to web sites such as Google, Facebook, Flickr and others. It is just a matter of time until a big scandal happens, be it the companies themselves that misuse the vast amounts of data they have, or be it a government agency in an overzealous effort of crime prevention.

When this will happen, people will re-evaluate the trend towards massive centralisation on the web, and will search for opportunities to make the same feeling of being ‘in the network’ happen in a distributed environment, without selling ones soul to a multinational corporation. Then we will find that such an opportunity already exists — the Semantic Web.

Read the whole interview here.

Jana Herwig

Multimedia in the Web of Data – Annotating and Interlinking Photos, Music, Multimedia [WOD-PD]

The Web of Data Practitioners Days concluded with the session on Multimedia in the Web of Data, the first part of which was led by Ansgar Scherp (University of Koblenz-Landau, Germany).

Multimedia content, as Ansgar pointed out, is hardly annotated, badly organized, and hardly ever looked at again – just think of the 300 something pics you might take on an average week-end getaway, and which you never touch again. Annotating multimedia content requires a lot of work and dedication – but most of the time, these pictures eventually dissappear in the “digital shoe box” that is your photo management software.

The most obvious remedy is to annotate content as early as possible, ideally when creating the content, ideally already on your portable camera (formerly known as: mobile phone:) Ansgar suggested to provide incentives for people to encourage picture annotation – professionals could for instance receive a higher financial reward if the deliver already annotated pictures. And of course there are ‘Games with a purpose’ such as Google Image Labeler, where players tag images in pairs, with and against each other, and are rewarded with the entertainment factor of the game.

The slide below shows what has happened (or will happen) to the process of creating photo books in the digital age and the age of mashups:

Ansgar Scherp's slides

After all, this is the age of the social semantic web, so why not try and (re-)use the content, structure and contexts that other users have already created on the web? Content augmentation, for the scope that Ansgar is concerned with, consists in the reuse of content and structures (e.g. from sources such as Flickr and Wikipedia, Geonames) made possible through the definition of rules, e.g.:

  • If there are two or less pictures on a page*
  • then automatically augment the page with additional photos using location information.

* Page here means a page in the album you are currently working on – you probably took a picture of yourself and your friend in Paris, and even though you went to the Centre Pompidou, you forgot to actually take a pic of the building itself – well, let the web be your library!

So the goal is clear: develop a procedure for applying automatic content augmentation in the creation of good photo books.

But what makes a ‘good’ photo book anyway? Here are some of the results of a structural analysis of real, human-created photobooks conducted at CeWe Color:

  • % of photos with faces: 36%
  • Number of album pages: 16.96
  • Photos per page: 6.69
  • Text fields per page: 1.45
  • % of pages with text: 87%

There are many rules that can be established from the structural analysis, which can be applied in turn in the creation of photoboooks, e.g. rules like this one,

  • If the text located in the upper third of a page
  • if the font size is equal or larger that 16 points
  • if the number of words is less than 10
  • if there is no caption on the page that has a bigger font size
  • then this page is the title

Ansgar recommended xSmart, which he described as a “context-driven authoring tool for page-based multimedia presentations.”

Ansgar’s presentation was followed by two more: one by Yves Raimond on Interlinking Music on the Web of Data, and one on Interlinking Multimedia – in spite of better intentions, I did not manage to cover these two in detail, but at least I gathered the links to relevant resources from all three sessions… Continue reading