Tassilo Pellegrini

NYT, Wolters Kluwer Germany and Semantic Universe sponsor Triplification Challenge 2010

3.000 € prize money for the  most promising Linked Data applications


The New York Times, Wolters Kluwer Germany and Semantic Universe sponsor the Triplification Challenge 2010 taking place at the I-SEMANTICS Conference from 1 – 3 September 2010 in Graz / Austria. Together they have provided 3.000 Euro in prize money which will be given to the  most promising application demonstrations and approaches built upon Linked Data.

The Challenge is organized by an international consortium consisting of Juan Sequeda (University of Texas at Austin), Bernhard Schandl (University of Vienna), Sören Auer (University of Leipzig), Ivan Herman (World Wide Web Consortium), Tassilo Pellegrini (Semantic Web Company Vienna) and patroned by Sir Tim Berners-Lee.

Participants can choose between an open track and a special NYT track.

The open track sponsored by Wolters Kluwer Germany and Semantic Universe encourages submissions that fit into one or more of the following categories:

  • novel data sets that are published as part of the Web of Data, according to Linked Data principles, and demonstrating potential benefit of use within applications;
  • novel generic mechanisms, approaches, and technologies that convert certain types and formats of information into triples, interlink them to other data sets, and expose them as Linked Data;
  • applications showcasing the benefits of Linked Data to end-users such as for information syndication, specialized search, browsing, or augmentation of content.

The NYT track invites submissions that make use of the Linked Data published at data.nytimes.com and one or more government datasets that relate to politics. Any dataset qualifies that is produced by any government in the world that would be of interest to a constituent of that government. These can range from the demographics of election districts to campaign finance to corporate spending on political messaging.

The submission deadline for both tracks is 18 May 2010.

All submissions will be reviewed by an international program committee from industry and academia electing two winners in the open track and one winner in the NYT track.

For detailed information on the Triplification Challenge visit

http://i-semantics.tugraz.at/triplification-challenge or

http://triplify.org/Challenge/2010

Cordial thanks to our sponsors:

Tassilo Pellegrini

Interview with Georgi Kobilarov: “I believe that data publishing must happen in a distributed style.”

Uberblic.org connects structured data from the web. The Berlin-based inventor Georgi Kobilarov gives a brief insight into the mashup service and talks about the challenges when it comes to build applications upon linked data.

You have recently published the service uberblic.org, a Linked Data mashup editor. What was your motivation to develop this tool?

Uberblic.org provides an integrated view of web data. Our goal is to integrate all the structured data on the web, and give web-developers a single point to access to that reconciled data. More than that, we will open up the tools we use to manage the data sources to the community, so that the people can help us curating that repository of free data. We re-publish all the data we import as Linked Data, under the licenses of the original data publishers.

Some of the data sources we import are available in the Linked Open Data cloud as well, but many are not. Linked Data is an elegant way to publish data in a distributed way on the web, but consuming it from that distributed cloud is – at least – impractical. In every real-world application using linked data from the web I’ve seen, organizations built up internal copies of the cloud, and often even reconcile linked data sources. They build their own Linked Data proxies. Uberblic.org helps those users by providing one public proxy for data from the web. Many of our sources get monitored for data changes, and the according data in uberblic is updated in real-time.

uberblic

Can you give us a brief insight how the tool works? What technology is is built on?

My company, Uberblic Labs, has developed a data integration platform that we use to power uberblic.org. We call it the Uberblic Platform (the name uberblic is derived from the German “Überblick” – English “overview”). This platform enables us to do the full process of “data fusion”: Importing and converting external data sources, mapping the data schemas to a central ontology, filtering out data errors, automatically suggesting duplicates to the user, and merging data from different sources into a single, reconciled representation.

Structured and semi-structured data from the web is an excellent use case for our software platform, since there we come across all the interesting cases of real-world data heterogeneity. But what I think is especially powerful and yet missing in other Linked Data projects I know, is the ability to subscribe to update-feeds. We do that extensively, fetching updates in real-time from Wikipedia and the like.

Our platform is built in Scala and runs a on cluster of machines, with workers communicating through a messaging system. We developed an RDF storage layer on top of a distributed key-values store for storing all provenance information used in the extraction process, currently around 100 million named graphs for uberblic.org. That storage layer does not directly provide SPARQL access, so we push all the output data into a SPARQL endpoint hosted by Talis as well.

What have been the biggest challenges in tackling the integration issues of dispersed data?

It was quite a steep learning curve to do Linked Data not only in an academic environment, but in a reliable, industry-strength set-up. In academia, there was always the excuse that things are just research prototypes. Now that excuse is gone. That’s also where it becomes necessary to manually clean up data. And there are two ways to do that: Either you enable the users to change facts directly in your repository after you have imported the external data (that is what Freebase does), or you facilitate clean-up cycles in the original data source and fetch these updates in real-time. That is what we do.

I believe that data publishing must happen in a distributed style, because then each data source gets taken care of by a specialized group of people using specialized tools. And it’s what you see not only on the web, but also inside organizations and enterprises. But consuming data trough centralized APIs is more than just convenient. We all use Google
or another search engine as a central access point to web pages which are published in a distributed way all over the web, don’t we? Can you imagine today researching a topic on the web without the centralization power of search engines, just by following links across web sites, like in the old days?

When we built the Uberblic Platform, some of the things I imagined to be large headaches, like schema mapping, turned out to work really well. Those pathologic cases you often see in academic “challenges” are – well – pathologic. It’s not necessary to solve them fully automatically through super-intelligent algorithms. Much more important than the sophistication of your algorithms are well designed workflows so that the user becomes a part of the solution. And that’s not about crowd-sourcing or swarm intelligence, the editorial curating of schema mappings and object reconciliation can be done just by a small team of people. If they have the right set of tools.

What are the next plans with uberblic.org? Where will the journey go?

Uberblic.org will continue to integrate more interesting and useful data sources from the web, and we will start making more APIs available to web developers to build their applications on top. We are also looking for partners who are interested in developing applications and have been struggling in the past to get the cross-source data from the web they need.

The work on improving uberblic.org will also benefit our Uberblic Platform, and hence our clients who use that same software for integrating organizational data sources with each other and with the web of data.

About Georgi Kobilarov

Georgi is founder and managing director of Uberblic Labs, a company based in Berlin specialized in Linked Data integration. He worked as a research associate in the Web-based Systems Group at Freie Universität Berlin and as a visiting researcher at Hewlett Packard Labs Bristol. As co-founder and lead developer of DBpedia, he was also a day-one contributor to the Linking Open Data project. Georgi is consulting with the BBC on several Linked Data related projects. He organizes the Web of Data Meetup London, a bi-yearly gathering of the UK Linked Data community. Georgi graduated with a Diplom in business administration from Freie Universität Berlin and has many years of work experience as a software developer. Visit his blog: http://blog.georgikobilarov.com

Tassilo Pellegrini

Interview with Marco Neumann: “It’s definitely an exciting time to be on the Semantic Web!”

Marco Neumann is an Information Scientist and CEO of KONA a consulting and technology service company based in New York City. The Semantic Web activist is an invited expert to the W3C HTML 5 working group. He recently started a discussion on the challenges and difficulties in bringing the Semantic Web into business. SWC asked him for some additional comments.

Marco, you recently initiated a discussion in a Google Group on the difficulty to change Semantic Web standards. What was the background of the discussion? Where do you perceive a need for action?

It’s not so much about changing this existing standards but the challenge to bring them into the world of practitioners and standards developers. The language used in W3C recommendations quite frequently requires advanced topic knowledge and familiarity with the jargon of the discussion about the respective technologies. I recently discussed this with a senior standards maven at the W3C and got the answer that the recommendations can’t be changed retrospectively and that they are intended to be used primarily by vendors for implementation purposes.

Well this might be the case but I also got the impression that Tim Berners-Lee objective for the W3C is primarily to meet the needs of a larger community. And the W3C took this into account for most of the Semantic Web recommendations in the past. Something I still find amazing is the fact that the work process at the W3C is partially and the recommendations are entirely publicly accessible. Though we definitely still need more and better tools to work with semantic web data, higher quality documentation and last but not least more user adoption on the web.

Critics of the Semantic Web often refer to the slow uptake of Semantic Web standards by industry. Is standards adoption actually a valid and sufficient metric to evaluate the maturity of a standard? What would be needed to accelerate the uptake?

I think we might see a similar scenario to the uptake of HTML in the early 90s, a relatively small number of technology mavens will pave the way towards making the Semantic Web more attractive as a technology solution for a wide range of applications and will successfully publish open data before we see business application developers make use of Semantic Web standards.

The availability of trustable and quality approved RDF data is crucial for the success of the Semantic Web. Given the fact that the aggregation business on the WWW is highly concentrated the corresponding formula is simple: If Google just consumes but does not give back RDF the Semantic Web won’t scale. Do you agree?

Yes and no. Yes we need better and more semantic data on the Web, but we will also need better ways to deal with trust in a lightweight and web friendly fashion. I currently see a number of semi automated approaches emerging  that could scale on the web. An example are distributed user based recommendation systems to validate authenticity, open Wikipedia style community evaluation and content curation a la freebase. Increased public accountability for data producers might be an interesting venue as well. In regards to Google I’d say web search engines will go where the web goes. A problem I might see arising is that web search engines will initially develop their own standards to deal with the emerging Semantic Web and confuse users on the web or might pursue a time consuming power play with the W3C. I see a little bit of that in the current discussion in the HTML 5 working group.

As we know from social sciences technological standards are necessary but always incomplete and unsatisfactory. From a standards design and outreach perspective: What would it need to make the Semantic Web flourish?

I’m not sure if we really know all that much about the laws of innovation and the evolution of technology standards at this point. If we draw from the short experience with the World Wide Web I would come to the conclusion that innovation takes place in small to medium size teams that pursue an independent vision of how services should be delivered and how the technology should be designed. In addition Tim Berners-Lee’s encourages the production of lots and lots of data to bootstrap the Semantic Web and create a pull for services in the industry. And indeed we really see some traction for example with the Linked Open Data and Open Government initiatives. It’s definitely an exciting time to be on the Semantic Web!

About Marco Neumann

Marco Neumann is an Information Scientist and CEO of KONA a consulting and technology service company based in New York City. KONA provides semantic technologies to businesses solutions and adds value to products and services in a highly networked economy. In addition Marco currently acts as an Invited Expert to the W3C on the HTML 5 working group and is the director of the global semantic social network lotico.com.

Andreas Blumauer

TuQS QuadStore combines the best of two worlds

A new QuadStore which combines the best of two worlds (Lucene/Fulltext search engines & TripleStores/RDF/SPARQL) is out and can be evaluated online.

TuQs offers the following feature:

  • SAIL accessible
  • True QuadStore with GraphSupport
  • HighSpeed regex SPARQL filters
  • Userrights on TripleBasis
  • Extendable to a QuintStore (or more generally to an n-Store)
  • Cachable SPARQL Queries for further speed improvement
  • Clusterable
  • Federationable
  • FullTextSearchable

Some queries are really complex and high-speed, e.g.:

SELECT ?s ?o
WHERE {
?s <http://www.w3.org/2004/02/skos/core#definition> ?o .
?o <http://www.turnguard.com/tuqs/function#BooleanTerm> ‘Computer AND (java* OR HTML)’
}

The best starting point to find out, what´s the speciality of TuQS is here: Just click the sample queries on the right side and see how fast they perform even on very simple hardware.

Next steps: The developer of TuQS, Jürgen Jakobitsch (aka Turnguard), is currently working on SAIL inferencing.