The Semantic Puzzle

Tassilo Pellegrini

Interview with Georgi Kobilarov: “I believe that data publishing must happen in a distributed style.”

Uberblic.orgUberblic.org is an integration service for data on the web. It acts as a layer between data publishers and data consumers. It consolidates and reconciles information in real time into a central data repository, and provides developer APIs on top. connects structured data from the web. The BerlinBerlin is the capital city of Germany and is one of the 16 states of Germany. With a population of 3.45 million people, Berlin is Germany's largest city. It is the second most populous city proper and the seventh most populous urban area in the European Union. Located in northeastern Germany, it ...-based inventor Georgi KobilarovFounder and managing director of Uberblic Labs, a company based in Berlin specialized in Linked Data integration. Previously working as a research associate in the Web-based Systems Group at Freie Universität Berlin and as a visiting researcher at Hewlett Packard Labs Bristol. Co-founder and ... gives a brief insight into the mashup service and talks about the challenges when it comes to build applications upon linked data.

You have recently published the service uberblicUberblic.org is an integration service for data on the web. It acts as a layer between data publishers and data consumers. It consolidates and reconciles information in real time into a central data repository, and provides developer APIs on top..org, a Linked Data mashup editor. What was your motivation to develop this tool?

Uberblic.org provides an integrated view of web data. Our goal is to integrate all the structured data on the web, and give web-developers a single point to access to that reconciled data. More than that, we will open up the tools we use to manage the data sources to the community, so that the people can help us curating that repository of free data. We re-publish all the data we import as Linked Data, under the licenses of the original data publishers.

Some of the data sources we import are available in the Linked Open Data cloud as well, but many are not. Linked Data is an elegant way to publish data in a distributed way on the web, but consuming it from that distributed cloud is – at least – impractical. In every real-world application using linked data from the web I’ve seen, organizations built up internal copies of the cloud, and often even reconcile linked data sources. They build their own Linked Data proxies. Uberblic.org helps those users by providing one public proxy for data from the web. Many of our sources get monitored for data changes, and the according data in uberblic is updated in real-time.

uberblic

Can you give us a brief insight how the tool works? What technology is is built on?

My company, Uberblic LabsResearch & development company based in Berlin, Germany. We develop systems to integrate information and help our clients to make best use of data on the web. (http://uberblic.com/), has developed a data integration platform that we use to power uberblic.org. We call it the Uberblic Platform (the name uberblic is derived from the German “Überblick” – English “overview”). This platform enables us to do the full process of “data fusion”: Importing and converting external data sources, mapping the data schemas to a central ontology, an ontology is a formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts. It is used to reason about the entities within that domain, and may be used to describe the domain. In theory, an ontology is a "formal, explicit ..., filtering out data errors, automatically suggesting duplicates to the user, and merging data from different sources into a single, reconciled representation.

Structured and semi-structured data from the web is an excellent use case for our software platform, since there we come across all the interesting cases of real-world data heterogeneity. But what I think is especially powerful and yet missing in other Linked Data projects I know, is the ability to subscribe to update-feeds. We do that extensively, fetching updates in real-time from Wikipedia and the like.

Our platform is built in Scala and runs a on cluster of machines, with workers communicating through a messaging system. We developed an RDF storage layer on top of a distributed key-values store for storing all provenance information used in the extraction process, currently around 100 million named graphs for uberblic.org. That storage layer does not directly provide SPARQL access, so we push all the output data into a SPARQL endpoint hosted by TalisTalis Group Ltd. is a software company based in Solihull near Birmingham, England that develops a Semantic Web application platform and a suite of applications for the education, research and library sectors. In 2005 Talis was voted one of the top ICT Employers in the United Kingdom. as well.

What have been the biggest challenges in tackling the integration issues of dispersed data?

It was quite a steep learning curve to do Linked Data not only in an academic environment, but in a reliable, industry-strength set-up. In academia, there was always the excuse that things are just research prototypes. Now that excuse is gone. That’s also where it becomes necessary to manually clean up data. And there are two ways to do that: Either you enable the users to change facts directly in your repository after you have imported the external data (that is what FreebaseFreebase is a large collaborative knowledge base. It is an online collection of structured data harvested from many sources, including individual 'wiki' contribution. Freebase aims to create a global resource which allows people (and machines) to access common information more effectively. It is ... does), or you facilitate clean-up cycles in the original data source and fetch these updates in real-time. That is what we do.

I believe that data publishing must happen in a distributed style, because then each data source gets taken care of by a specialized group of people using specialized tools. And it’s what you see not only on the web, but also inside organizations and enterprisesA company is a form of business organization. In the United States, a company is a corporation—or, less commonly, an association, partnership, or union—that carries on an industrial enterprise. " Generally, a company may be a "corporation, partnership, association, joint-stock .... But consuming data trough centralized APIs is more than just convenient. We all use GoogleGoogle Inc. is a multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program. The company was ...
or another search engine as a central access point to web pages which are published in a distributed way all over the web, don’t we? Can you imagine today researching a topic on the web without the centralization power of search engines, just by following links across web sites, like in the old days?

When we built the Uberblic Platform, some of the things I imagined to be large headaches, like schema mapping, turned out to work really well. Those pathologic cases you often see in academic “challenges” are – well – pathologic. It’s not necessary to solve them fully automatically through super-intelligent algorithms. Much more important than the sophistication of your algorithms are well designed workflows so that the user becomes a part of the solution. And that’s not about crowd-sourcing or swarm intelligence, the editorial curating of schema mappings and object reconciliation can be done just by a small team of people. If they have the right set of tools.

What are the next plans with uberblic.org? Where will the journey go?

Uberblic.org will continue to integrate more interesting and useful data sources from the web, and we will start making more APIs available to web developers to build their applications on top. We are also looking for partners who are interested in developing applications and have been struggling in the past to get the cross-source data from the web they need.

The work on improving uberblic.org will also benefit our Uberblic Platform, and hence our clients who use that same software for integrating organizational data sources with each other and with the web of data.

About Georgi Kobilarov

Georgi is founder and managing director of Uberblic Labs, a company based in Berlin specialized in Linked Data integration. He worked as a research associate in the Web-based Systems Group at Freie Universität Berlin and as a visiting researcher at Hewlett PackardHewlett-Packard Company, commonly referred to as HP, is an American multinational information technology corporation headquartered in Palo Alto, California, USA. The company was founded in a one-car garage in Palo Alto by Bill Hewlett and Dave Packard, and is now one of the world's largest ... Labs Bristol. As co-founder and lead developer of DBpediaDBpedia is a project aiming to extract structured information from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources, ..., he was also a day-one contributor to the Linking Open Data project. Georgi is consulting with the BBCThe British Broadcasting Corporation (BBC) is the principal public service broadcaster in the United Kingdom. It is the largest broadcaster in the world with about 23,000 staff. Its global headquarters are located in London and its main responsibility is to provide public service broadcasting in ... on several Linked Data related projects. He organizes the Web of Data Meetup LondonLondon is the capital city of England and the United Kingdom, the largest metropolitan area in the United Kingdom and the largest urban zone in the European Union by most measures. London has been a major settlement for two millennia, its history going back to its founding by the Romans, who ..., a bi-yearly gathering of the UK Linked Data community. Georgi graduated with a Diplom in business administration from Freie Universität Berlin and has many years of work experience as a software developer. Visit his blog: http://blog.georgikobilarov.com

This entry was posted in Linked Data & Open Data, Mashups & Web services, Semantic Web Applications, Tools & Software and tagged , by Tassilo Pellegrini. Bookmark the permalink.
Tassilo Pellegrini

About Tassilo Pellegrini

From Wikipedia, the free encyclopedia: Prof. (FH) Dr. Tassilo Pellegrini (born 1974) studied International Trade, Communication Science and Political Science at the University of Salzburg and University of Málaga. Since end of 2007 he is running the New Media Division at the University of Applied Sciences in St. Pölten. He obtained his master degree in 1999 from the University of Salzburg on the topic of telecommunications policy in the European Union, which was followed by a PhD in 2010 on the topic of bounded policy-learning in the European Union with a focus on intellectual property policies. His current research encompasses economic effects of internet regulation with respect to market structure and basic civil rights. He is member of the International Network for Information Ethics (INIE), the African Network of Information Ethics (ANIE) and the Deutsche Gesellschaft für Publizistik und Kommunikationswissenschaft (DGPUK). Beside his specialisation in policy research and media economics Tassilo Pellegrini has worked on semantic technologies and the Semantic Web. He is co-founder and Head of Division Research and Development of the Semantic Web Company in Vienna, co-editor of the first German textbook on Semantic Web and Conference Chair of the annual I-SEMANTICS conference series founded in 2005.

One thought on “Interview with Georgi Kobilarov: “I believe that data publishing must happen in a distributed style.”

  1. Seems great software with a real purpose.
    If uberblic.org would have been located in San Francisco Bay Area then TechCrunch, Venturebeat, ReadWriteWeb and GigaOm would have been blogging about it to let teh world know.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>