Semantic Web Company

The Semantic Puzzle

Open World Assumptions

subscribe RSS

Archive for the ‘Linked Data & Open Data’

I-Semantics 2010: Relevance of semantic technologies for industry increases fast

July 01, 2010 By: Andreas Blumauer Category: Calls & Competitions, Conferences & Events, Corporate Semantic Web, Linked Data & Open Data 1 Comment →

I-Semantics 2010

I-Semantics will take place for the 6th time this year in September and it will be co-located again with I-Know in Graz/Austria. This year´s programme shows that Semantic Web and semantic technologies in general are increasingly relevant for all kind of industries:

  • Biomedicine
  • Public administration & Public transport
  • Information technology
  • Libraries
  • Media & Content Industry
  • E-commerce
  • Education etc.

450 people in 2009

I-Semantics “Industry Track” with its 3-days programme full of demos is one of the highlights of the congress. With 28 submissions this year´s Triplification Challenge tells a lot about the significance of Linked Data in areas like librarianship, public administration or GIS & environmental planning. Take a look at the 15 nominees – and if you consider to come to I-Semantics 2010 follow the link for registration.

Sphere: Related Content

Stella Dextre Clarke & Alan Gilchrist about the “Future of Knowledge Organization on the Web”

June 21, 2010 By: Andreas Blumauer Category: Linked Data & Open Data, Tools & Software, Vocabularies & Languages 1 Comment →

Semantic Web Company (SWC) had the pleasure and the opportunity to talk with two internationally recognised experts in the fields of information management and knowledge organization: Alan Gilchrist and Stella Dextre Clarke. SWC asked some questions about the “Future of Knowledge Organization on the Web & Linked Data” on the occasion of an event of the same name organised by ISKO UK which will take place on September 14, 2010 in London.

1. Alan, you are one of the leading experts in the field of thesaurus construction. Organising knowledge in a (worldwide) Semantic Web is a rather young discipline compared to your domain. What do you think can the Semantic Web community learn from “traditional” thesaurus management and vice versa?

You put inverted commas round the word traditional, but it might be more appropriate to put them round the word thesaurus! So long as words are used in information retrieval and in information sharing, different forms of structured vocabularies will be required, and many of the fundamental principles of thesaurus construction are still valid for their construction. Of course, the “traditional” thesaurus has mutated since the days when it was used only for controlled indexing and retrieval; and now, with the many enrichments possible it can be viewed as an ontology (in one of the definitions of this word). What remains a difficulty is to create a generalisable typology of associative relationships, though this is, of course, possible in relatively closed systems. In short, structured vocabularies with broadly thesaurus formats will be a necessary component in the web stack.

2. Stella, as a consultant you are specialized in the design and implementation of knowledge structures for information retrieval applications. In the last few months we have seen that SKOS can serve as a significant building block to link “traditional” thesaurus management to knowledge structures from the semantic web. Can you see that this development is market-driven, is there a significant growth of demand for solutions built around SKOS?

This question sounds surprisingly sceptical about the growth of SKOS. I guess the dizzying speed of phenomena like Facebook and Twitter has fuelled expectations of tools springing up overnight like mushrooms, fully formed and ready to eat. But actually it takes time, not just for the tools to be fashioned, but for the potential market to develop an understanding of what they can do and what will happen next when they are used.

Applications for SKOS are springing up all the time, as fast as people can grow the skills and vision to deploy them. At the moment the market, or shall we say the power-base, seems to be with the academic sector and allied not-for-profit organisations. This will spread progressively through the public to the private sector, as enterprises find ways of adapting their business models. The main hurdles to overcome could be intellectual property rights and the need for compilers of databases to keep earning their living.

3. Alan, constructing thesauri for the semantic web also means that one has to make the “open world assumption”. In which sense does this change the way to manage thesauri, keep them growing and assure quality? Can you see new, upcoming methodologies to do that?

Everything changes with the “open world assumption”! Following on from my answer to the previous question, it seems clear that one manifestation of the thesaurus will be found in those systems that support interoperability, such as federated searching or metadata registries. Even with simple thesaurus management software, it is possible to construct a “master vocabulary” or “word bank” to support different applications within an enterprise; thereby promoting interoperability. More sophisticated software is already available (though not very widely); more will be needed and, doubtless, will be created.

A more formal answer to both questions will be found in a new standard – ISO 25964, currently being prepared on the basis of BS 8723. The two fundamental features of these two standards are (1) the thesaurus as a theoretical and practical basis for the construction of structured vocabularies for information retieval and (2) the growing and vital need for interoperability between systems and the intelligent mapping of the vocabularies used by those systems.

4. Stella, just recently at ESWC 2010, Sean Bechhofer was asked during his keynote why there are so few SKOS tools on the market. What do you think are the reasons for this? Are there still shortcomings of the SKOS specification compared to other existing thesaurus standards? (see also: http://www.eswc2010.org/program-menu/keynote-speakers/155-sean-bechhofer & http://www.slideshare.net/seanb/skos-past-present-and-future )

Regarding the speed of development, see my reply above. As to shortcomings, did you note in one of Bechhofer’s slides: “Standardisation is necessarily a compromise: Everyone equally unhappy = success!” The SKOS development team took a conscious decision to keep the schema sufficiently simple that it could be applicable to as many different types of KOS as possible.  On the downside, this means SKOS is unsatisfactory for conveying sophisticated features of some thesauri and classification schemes. But by keeping the entry barrier low, more widespread use has been encouraged.

By way of illustration, compare SKOS with the data model and XML schema of BS 8723. This schema is comparatively specialized, with the aim of enabling exchange of any thesaurus carrying any or all of the features recommended in the standard. And incidentally, this data model and schema will have some further capabilities added when published in the forthcoming standard ISO 25964. SKOS does not provide for a number of features in these standards (such as compound equivalence). But the schemas in BS 8723 and ISO 25964 are designed for thesaurus developers to share their work, rather than for easy publication on the Web, and will never have so many users or associated tools as SKOS.

So I believe that SKOS has done well to accept compromises that encourage generalisation although they might not suit some specialists. That said, I do regret one of its weaknesses in the context of mapping. Compound equivalence mappings (that is to say, where Concept A in one vocabulary maps to a combination of Concepts  B and C in another) are very commonly needed when extending a search across multiple databases, and the SKOS mapping properties do not currently allow for them. Perhaps there will be some provision in future?

5. Stella, Alan, in September ISKO UK will organise an event on “The Future of Knowledge Organisation on the Web”. “Linked Data” seems to be a promising approach to organise knowledge in large scale environments.
Could you imagine that SKOS as a small subset of semantic web specifications will play a central role in this environment since it is quite intuitively comprehensible by virtually any knowledge worker or do you rather think SKOS is too simple (or too complex)? (see also: http://poolparty.punkt.at/using-skos-as-an-interface-to-the-linked-data-cloud )

Stella: Of course SKOS will have a central role (whether or not every knowledge worker finds it as intuitive as you suppose). “Linked Data” will find even wider applicability. ISKO-UK (the organiser of the meeting in London on 14 September) has a mission not just to spread the word about both these technologies, but to build bridges between the several communities who must share their expertise and data to build more exciting applications. We’re expecting an audience of over 100 at this low-cost event.

Alan: Yes, of course, just as all the tools in the web stack will be necessary if semantic web technologies are to be effective. But it is obvious that we are dealing with complexities of a higher order than ever before. Any structured vocabulary is an “artificial language” which, while acknowledging many aspects of theoretical linguistics is forced to be pragmatic in its construction. Consequently, it would not be surprising if SKOS is seen to be “catching up”, and this became apparent in the work of BS 8723 when thesaurus models using UML were being constructed. There remains much work to be done on all fronts.

Stella Dextre Clarke is an independent consultant specializing in the design and implementation of thesauri and other knowledge organization structures. She currently leads ISO NP 25964, the project to update and revise the international standards for thesauri. Previously she was the Convenor of the Working Group which developed BS 8723. In 2006 she won the Tony Kent Strix Award for outstanding achievement in information retrieval, in recognition for her development work on IPSV (Integrated Public Sector Vocabulary), as well as on the vocabulary standards. She is a Fellow of the Chartered Institute of Library and Information Professionals.

Alan Gilchrist has been a consultant for many years in the fields of information management and information architecture, specialising in the vocabulary aspects of information retrieval. He is co-author, with Jean Aitchison and David Bawden of Thesaurus Construction and Use, now in its fourth edition. In 1979 he founded and edited the Journal of Information Science, and is now Editor Emeritus. He has an Honorary Degree (D. Litt.) from the University of Brighton and is an Honorary Fellow of the Chartered Institute of Librarians and Information Professionals.

Sphere: Related Content

Kingsley Idehen: “By declaring its context, Linked Data can be made more easily reusable by others”

June 16, 2010 By: Andreas Blumauer Category: Corporate Semantic Web, Enterprise 2.0, Linked Data & Open Data, Tools & Software No Comments →

Semantic Web Company talked with Kingsley Idehen who is CEO of OpenLink Software and probably one of the most profound experts on data integration issues about “Linked Data”.

The interview covers questions like:

  • How can Linked Data help to make companies more productive?
  • Do you think that the Linked Data Initiative can build upon a stable architecture or will it face more and more problems the bigger the “cloud” will grow?
  • What´s the ultimate argument for an Enterprise Architect to use languages like SPARQL at least in addition to SQL?
  • How will a “Real Time Semantic Web” change the whole game?
  • How will the “Semantic Web” be called in 10 years? Will there still be a “Semantic Web”?

Read the full version of the interview here.

Sphere: Related Content

Adrian Pohl: “We believe the Semantic Web plays an important role for the future of libraries.”

May 20, 2010 By: Tassilo Pellegrini Category: Companies & Institutions, Linked Data & Open Data No Comments →

A group of Cologne-based libraries has taken a big step towards open data. In an concerted action they have relased their catalogue data for reuse on the web. Project manager Adrian Pohl comments on the initiative and what role the Semantic Web will play for libraries in the future.

In March 2010 several Cologne-based libraries have opened their catalogue data under a CC0 license following Tim Berners-Lee’s call for “Raw Data Now!”. What has been the motivation behind this step?

The hbz (“Hochschulbibliothekzentrum des Landes Nordrhein-Westfalen”, english: “North Rhine-Westphalian Library Service Centre”) has come to the conclusion that libraries need to participate in the development of the Semantic Web. The opening of catalog data followed as a necessary first step. Our intention is to show with this first legal-political step how important the legal/licensing dimension is when you publish data on the web, be it Linked Data or not. So for us at the hbz the Open Data initiative primarily is seen as the first step in eventually publishing Linked Open Data just as Tim Berners-Lee had called for.

Other participants in the Cologne Open Data initiative like the Cologne University and City Library focus more on the direct advantages the releasing of raw bibliographic data bings: With other libraries and consortia following this example it will be easy to enrich existing catalog or other bibliographic services with subject headings, classification numbers, tags etc. Also, published raw data is integrated into other web services like Wikipedia which point back to libraries’ services. Indeed, Open Data is an end in itself which should be pursued by more organizations in the library world and beyond it.

The provided data is currently availble in a proprietary but open format. Can you give us some technical description of the published data? Do you have plans in providing more structured datasets in the future?

“Opaque but open” would be the better description of the underlying format because it isn’t proprietary at all. Actually, alongside the data from the hbz union catalog there is data stemming from libraries’ local databases (see http://opendata.ub.uni-koeln.de/ and http://opendata.zbsport.de/). We are using different internal formats. Generally, all the formats are based on the MAB format (an acronym for “Maschinelles Austauschformat für Bibliotheken” which means “Automatic Interchange Format for Libraries”) that is only used in the German and Austrian library world for the data interchange between libraries similar to the better known MARC format (Machine-Readable Cataloging) of the Library of Congress. It was developed in the 1970s for storing data on magnetic tape. The format documentation can be viewed on the German National Library’s webpages.   As the format is nearly 40 years old, the processing of MAB data is very cumbersome on modern computers. Therefore, the hbz provides an encapsulation method called “generic format”, where the historic data records of the library catalogs are unwrapped into a more common, user-friendly scheme. Each record is placed into a Unicode UTF-8 encoded file, containing all the MAB fields, each of them separated by line feeds, and the whole record set of a library is forming a “tar” archive, which is compressed afterwards to save space.   It is possible to dump those archives by a usual unpack tool. This software is available on all known Windows/Linux/Unix platforms. Or you can use a simple Perl helper script provided by hbz. More tools and scripts, even in other programming languages, are in preparation for publication.   The opaqueness and the age of the standards used in the library world (the english standard MARC which is used worldwide doesn’t differ in these respects from MAB) make it necessary to change to a more open and widely adopted standard. That’s where Linked Data comes into play which is based on the accepted and widespread standards HTTP and URIs. The construction of RDF out of the library catalog raw data is a very sophisticated design task. Our plans are to convert the existing data to RDF using proper vocabularies which enable us to lose as little information as possible and giving access to the data by providing a SPARQL endpoint.

Currently the data you provide is open but not yet linked. What are your plans when it comes to contribute to the Linked Data Cloud?

I have to go into greater detail to answer this question properly. Viewed simply, the data of library institutions can be divided into two broad types: authority data and bibliographic data. Authority data splits up in data about people, about corporate entities and about subject headings. In Germany, authority data is maintained centrally by the German National Library in cooperation with the six German library consortia. Bibliographic databases consist of records about books or rather editions of books. Authority data and bibliographic data are already heavily linked, for instance a bibliographic record contains the author’s or editor’s authority number which links to the corresponding authority record.   The German National Library is also working on migrating library data, especially authority data, into the Semantic Web. They recently made their Linked Data prototype for authority data publicly available. We have already taken first steps to cooperate and coordinate our efforts. The colleagues at the German National Library have recently developed a Linked Data prototype for their authority data. As they take care of authority data we focus ourselves on bibliographic data. At the moment we are exploring the technology and vocabularies for publishing bibliographic data as Linked Data. That’s a demanding task because besides the known vocabularies like Dublin Core or the Bibliographic Ontology (Bibo) which don’t fully map to the density and structure of the information in the catalogs, there has been several years’ work on the new comprehensive cataloging standard RDA (Resource Description and Access) for which a RDF representation has been developed. However, RDA in RDF needs to be modified a lot so that it can be applied to our bibliographic data. We are currently working on a vocabulary for the union catalog’s data based on existing vocabularies like Bibo and RDA.   Of course, as soon as we will have published bibliographic data as linked data we will start linking to hubs in the Linked Data Cloud like DBpedia or GeoNames.

Publishing data to the LOD Cloud is one thing. Consuming data is another. Have you plans to integrate data from the LOD Cloud into your systems? Do you have policies for quality assurance?

Of course the possibility to incorporate data from other sources easily is one major reason for us to publish Linked Data besides the goal of making libraries’ data an integral part of the web. Enriching our data with other data and providing new services through and with mashups would be a main reason to link to other data. We are, however, not working on such projects yet, because we first need to convert our legacy data to RDF.

What role will the Semantic Web play for libraries in the future?

We believe the Semantic Web plays an important role for the future of libraries. Discussions about “Next Generation Catalogs” are a recurring theme in the library world since the 1990s. It is time to finally act and move our data enprisoned in opaque formats to a new level by improving its structure and underlying technology and by migrating to formats that can be easily consumed by others who are not part of the library world. Joining the Linked Open Data community seems to us the best way to go.   Also, the production, publication and dissemination of academic literature is subject to ongoing and fundamental changes which have far-reaching implications for the work of academic libraries and their role in research and education. We believe that semantic markup and interlinking will play an important role in the development of knowledge production and thus indirectly will have great impact on libraries. Clearly, the Semantic Web can’t be cancelled out of the future of libraries.

Moreover, turning your question around, libraries could play an important role for the future of the Semantic Web. Libraries are trusted institutions and deeply grounded in our culture. As indicated above libraries have produced linked data (again: lower case) since the time of card catalogs. We undoubtly have some practice in producing and curating linked data which should be worth a lot to the Semantic Web community. We thus think libraries are predestinated for helping to coninuously order the messy place the Semantic Web always will be and ensuring its trustworthiness and stability.

About Adrian Pohl

Adrian Pohl is working at the Cologne-based North Rhine-Westphalian Library Service Center on Open Data, Linked Data and its conceptual, theoretical and legal implications. He regularly writes at Übertext: Blog about the internet, libraries and metadata, Linked Open Data, communication, epistemology and the like. He has studied communication science and philosophy in Aachen and is currently studying Library and Information Science at the Cologne University of Applied Science. You can follow him on Twitter: http://twitter.com/acka47.

Sphere: Related Content

A Dynamic Web Of Data

April 26, 2010 By: Michael Hausenblas Category: Linked Data & Open Data, Semantic Web Applications 2 Comments →

As a matter of fact things change – the Web of Data is no exception in that respect. While some sources, such as Twitter, are intrinsically dynamic, others change every now and then, potentially in unforeseeable intervals. In the recent Talis Nodalities Magazine, we made a case for Keeping up with a LOD of changes; here I’m going to elaborate a bit more on the current state of Dataset Dynamics and its challenges.

Let us first step a back a bit and have a look what Dataset Dynamics are and why this is important. In the Web of Linked Data we typically deal with datasets, for example, from the biomedical domain or the media industry on the one hand, and entities, such as a certain protein or people on the other. For the entity-level case established HTTP caching mechanism can be leveraged (see the Caching Tutorial and Things Caches Do). Further, with Memento, a HTTP-based versioning mechanisms has been proposed as well as implemented, adding a “time dimension” to HTTP (see Fig. 1).

Fig. 1 Memento Framework (Source: "An HTTP-Based Versioning Mechanism for Linked Data" Herbert Van de Sompel, Robert Sanderson, Michael Nelson, Lyudmila Balakireva, Harihar Shankar, Scott Ainsworth, LDOW 2010)

Dataset-level changes

However, tackling dataset-level changes is a rather new field with no agreed-upon, even less standardised solution handy. The main problem is that a dataset typically talks about many thousands to millions of distinct entities, which makes it impractical to apply entity-level solutions for a range of use cases, such as link maintenance or replication (see also Fig. 2).

Fig. 2 Change frequency vs. change volume

I often hear these days: “it seems there is no solution for handling of dataset-level changes”; nevertheless, I think quite the opposite it true. There are plenty of proposed solutions from both the academia and practitioners, targeting different challenges in the areas of:

  • Change discovery – how do I find out about about dataset changes?
  • Propagating changes - if there is a change, how is the change communciated to a consumer?
  • Change semantics – how do I learn what has changed (has been added, removed, etc.)?

Some proposals on the table are integrated approaches (such as DSNotify, SemanticPingback, Talis Changeset) while others focus on certain aspects (like the dady vocabulary for discovery or the Graph Update Ontology for change semantics) or deal concrete environments, for example sparqlPuSH for SPARQL enpdoints.

A Dataset Dynamics Manifesto

No matter on what (set of) solutions the community eventually agrees on to address the handling of dataset-level changes, it should adhere to the following principles:

  • light-weight
  • distributed and scalable
  • standards-based

Obviously, a light-weight (and ideally RESTful) approach lowers the barriers to adoption and enables a quick uptake. When I say light-weight, I mean it both in terms of protocol and code. It should be easy to integrate in RDF stores and libraries and available in all common Web programming languages including but not limited to Java, PHP, .NET family, etc.

Just as the Web of Data is a globally distributed dataspace, handling of changes should be done in a distributed fashion. There will be many different publishers and consumers (such as agents, indexer, consolidator platforms, etc.) of datasets with different requirements and capabilities. A distributed approach can cope with this challenge in a cost- and performance-efficient way. Tightly connected to this: It has to scale. Today, we’re dealing with some hundreds of LOD datasets. In the next couple of years, this will likely explode into the millions and hence one needs to be able to deal with such a growth. The same, just sooner, is true for the number of consumers of the changes.

Last but not least the Dataset Dynamics solution should be based on standards. It doesn’t necessarily need to be RDF for all of the challenges as outlined above. For example, Atom offers a standardised, extensible and widely accepted format to propagate changes; to take this further Pubsubhubbub can be utilised to enable a standardised, distributed publisher-subscriber scheme (Fig 3.)

Fig. 3 Pubsubhubbub - a standard-based, distributed publisher-subscriber-hub system (Source: http://docs.google.com/present/view?id=ajd8t6gk4mh2_34dvbpchfs)

As I’ve outlined above, it might still be too early for a conclusion on how to deal with dataset-level changes. However, people interested in this area have gathered already in the Dataset Dynamics group where solutions are discussed and implemented, potentially leading to a W3C standardisation work.

As an aside: in case you’re at the WWW2010 in Raleigh (NC, USA) these days, you may want to join the break-out meeting on Dataset Dynamics during the W3C Linked Open Data track on 29 April 2010.

(This blog post was written by Michael Hausenblas)

Sphere: Related Content

Sören Auer: “Establishing a network effect around linked data is the most important R&D goal for the near future.”

April 15, 2010 By: Tassilo Pellegrini Category: Conferences & Events, Linked Data & Open Data, Politics, Privacy & Information Ethics No Comments →

Leipzig is one of Germany’s Semantic Web hotspots. From May 5-6, 2010 the annual Semantic Web Day provides the opportunity to catch up with latest developments especially in the domain of Linked Data and the foundation of the German chapter of the Open Knowledge Foundation. Organizer Sören Auer gave us some background information.

From May 5 – 6, 2010 the 3rd Semantic Web Day in Leipzig will take place. What will be this year’s topics? Who should attend?

The Semantic Web Day is targeting IT people, software developers, decision makers and users interested in learning about the potential of semantic technologies. The language during the event is German, so primarily Austrians, Swiss and Germans will attend. Beside semantic technologies a particular focus of this years event is open data in governments, public administrations and science. Although the programme is not yet finalized we already compiled an interesting number of talks and presentations including talks about the open biodiversity database Fishbase, the European Digital Library Europeana, a Linked Data project of the German Umweltbundesamt, use case presentations in the pharma, publishing and telecommunication industries and many more (cf. http://aksw.org/LSWT). Also, in addition to AKSW the Topic Maps Lab and the Web Data Integration Labs from Universität Leipzig be present at LSWT.

One of the highlights of this year`s Semantic Web Day is the official institutionalization of the German Chapter of the Open Knowledge Foundation. How did this come around? What does this mean for the OKF as a whole?

OKFN started to work in 2006 and since then managed to sucessfully complete a number of projects facilitating open knowledge. In particular, the Comprehensive Knowledge Archive Network (CKAN), the OKCon conference series, the open knowledge definition and recently OKFN’s involvement in the launch of data.gov.uk are prominent examples of OKFN’s successful work. However, many of the OKFN activities were primarily driven by an active group of volunteers in the UK. With the official launch of the German OKFN branch we will strengthen the international dimension of OKFN’s work. Especially for Germany, where data privacy and security are perceived to be most important, raising awareness for enabling open, standards compliant access to public information will be an important target of OKFN’s activities.

The InFAI has become one of the hotspots in Semantic Web development in Germany over the past few years. What are you working on at the moment? What are the most interesting research and development aspects for the near future?

From our point of view establishing a network effect around the publishing and use of linked data is the most important research and development goal for the near future. We just completed a first draft and implementations of a semantic enabled pingback method (http://aksw.org/Projects/SemanticPingBack), which applies a similar peer notification mechanism to linked data endpoints as it is widely deployed on the blogosphere. Other important research issues we are tackling with our partners are closing the performance gap between RDF and relational data management, increasing the coherence and quality of linked data and the provisioning of adaptive user interfaces for authoring and maintaining information on the data web.

About Sören Auer

Dr. Sören Auer leads the research group Agile Knowledge Engineering and Semantic Web (AKSW) at University of Leipzig. His research interests include Semantic Web technologies, knowledge representation, engineering and management, agile methodologies as well as databases and information systems. Sören is founder (respectively co-founder) of several high-impact research and community projects such as the Wikipedia semantification project DBpedia, the open-source innovation platform Cofundos.org or the social Semantic Web toolkit OntoWiki. Sören is author of over 50 peer-reviewed scientific publications, co-organiser of several workshops, chair of the Social Semantic Web conference 2007 and I-Semantics 2008, serves as an expert for industry, the European Commission, the W3C and is member of the advisory board of the Open Knowledge Foundation.

Sphere: Related Content

Interview with Juan Sequeda: “I believe Linked Data will enable new killer apps that are only possible thanks to Linked Data.”

April 14, 2010 By: Tassilo Pellegrini Category: Calls & Competitions, Linked Data & Open Data, Semantic Web Applications 1 Comment →

Juan Sequeda, co-chair of the Triplification Challenge 2010 and one of the core figures in the Linked Data movement, gives us his view how the Semantic Web might evolve. His central message: “Once there is an incentive to create quality links, these links will start to show up. And then users will start linking to the data hubs of their interest.”

Linked Data itself has grabbed a lot of attention inside the Semantic Web community recently. But what about the outside perspective? Could linked data be called the killer app for the Semantic Web?

I foresee two things happening with Linked Data. One is from the web development perspective (the so-called Web 2.0 developers) and the other is from the enterprise perspective. The web development community will sooner than later realize that Linked Data will enable easy integration of data and therefore will ease the pain of consuming data from different data sources. Thanks to big organizations such as BBC, New York Times, Reuters, Best Buy, etc. web developers will start paying attention to this “new thing” called Linked Data.

What we need is that the inside Semantic Web community starts to create applications on top of current Linked Data so when the outside web development community starts to pay attention, they have something to chew on. We (the semantic web community) needs to start speaking the web development language. There is still a big gap. I have had personal experiences with people in the web development community who think that RDF is XML and because they hate XML, they will never consider it. This is false and this is something that we need to change.

From the enterprise perspective, Linked Data is another data integration solution. Data integration has been a problem since day one of relational databases. I believe enterprises will be open to consider new solutions with new technologies. I’m hoping to see new startups tackling the enterprise domain. Imagine being able to query “get all my clients from cities whose population is greater than 1 million” even though I don’t have the data about population of cities in my database.

Is Linked Data the killer app for the Semantic Web? Before I answer that, I would like to ask, what was the killer app of the Web? Was it the browser? Was it e-commerce? Was it search? Was it Amazon or Ebay or Google? I believe Linked Data will enable new killer apps, apps that are only possible thanks to Linked Data. The browser was only possible because of HTML. So let’s ask ourselves what is possible because of Linked Data, and there we will find our killer app.

One of the core deficiencies of the young open data cloud is the little amount of interlinks between datasets. Is it just a matter of time to overcome this or are there other measures needed to turn the existing datasets into a true giant global graph?

I like to remind myself that this new wave of semantic web technologies is an extension of the current web. Therefore we should analyze how the web evolved in the beginning. Initially, everything were a bunch of documents on the web in which people manually created links to other documents. When Google started, it created an incentive to offer quality links between documents. This also created data hubs. If you write a blog post about a book, most probably you will link to the web document of that book either on Amazon and/or Wikipedia. I believe that this will happen with Linked Data. Once there is an incentive to create quality links, these links will start to show up. And then users will start linking to the data hubs of their interest.

Open Governmental Data is a big issue at the moment. The US and UK government have started to apply Linked Data principles to turn this vision into reality. Lots of other countries are following. What do you expect from this trend?

I believe that Linked Data will take off thanks to the initiative of governments. We always talk about the chicken and egg problem of the semantic web. Once we have organizations that don’t even think about it and are just interested in putting their data on the web, the semantic web will start to grow. If Bookstore ABC puts their data on the web, it may not be so meaningful. But if the US and UK government puts their data on the web, following the Linked Data principles, then people can wake up and say “ok, so this is for real. Let me start paying attention to this”.

You are one of the chairs of the Triplification Challenge 2010. Can you give us a brief insight what to expect from this year’s challenge? What are the conditions to participate?

The Triplification Challenge this year has grown and is very exciting. For the first time, it is offering two different tracks.

The first track, the Open Track will accept submissions on three areas 1) new datasets that are published following the Linked Data principles and that show potential benefit, 2) generic methods, mechanisms and approaches of creating Linked Data from legacy datasets and 3) applications that make use of Linked Data.

The second track is the New York Times track which will accept submissions of applications that make use of the New York Times Linked Data and one or more government dataset. The objective is to create an application powered by Linked Data that would be of interest to any constituent of that government.

I personally believe that the year 2010 is the year of creating Linked Data applications and the Triplification Challenge is the way to be part of it.

Sphere: Related Content

Interview with Georgi Kobilarov: “I believe that data publishing must happen in a distributed style.”

March 26, 2010 By: Tassilo Pellegrini Category: Linked Data & Open Data, Mashups & Web services, Semantic Web Applications, Tools & Software 1 Comment →

Uberblic.org connects structured data from the web. The Berlin-based inventor Georgi Kobilarov gives a brief insight into the mashup service and talks about the challenges when it comes to build applications upon linked data.

You have recently published the service uberblic.org, a Linked Data mashup editor. What was your motivation to develop this tool?

Uberblic.org provides an integrated view of web data. Our goal is to integrate all the structured data on the web, and give web-developers a single point to access to that reconciled data. More than that, we will open up the tools we use to manage the data sources to the community, so that the people can help us curating that repository of free data. We re-publish all the data we import as Linked Data, under the licenses of the original data publishers.

Some of the data sources we import are available in the Linked Open Data cloud as well, but many are not. Linked Data is an elegant way to publish data in a distributed way on the web, but consuming it from that distributed cloud is – at least – impractical. In every real-world application using linked data from the web I’ve seen, organizations built up internal copies of the cloud, and often even reconcile linked data sources. They build their own Linked Data proxies. Uberblic.org helps those users by providing one public proxy for data from the web. Many of our sources get monitored for data changes, and the according data in uberblic is updated in real-time.

uberblic

Can you give us a brief insight how the tool works? What technology is is built on?

My company, Uberblic Labs, has developed a data integration platform that we use to power uberblic.org. We call it the Uberblic Platform (the name uberblic is derived from the German “Überblick” – English “overview”). This platform enables us to do the full process of “data fusion”: Importing and converting external data sources, mapping the data schemas to a central ontology, filtering out data errors, automatically suggesting duplicates to the user, and merging data from different sources into a single, reconciled representation.

Structured and semi-structured data from the web is an excellent use case for our software platform, since there we come across all the interesting cases of real-world data heterogeneity. But what I think is especially powerful and yet missing in other Linked Data projects I know, is the ability to subscribe to update-feeds. We do that extensively, fetching updates in real-time from Wikipedia and the like.

Our platform is built in Scala and runs a on cluster of machines, with workers communicating through a messaging system. We developed an RDF storage layer on top of a distributed key-values store for storing all provenance information used in the extraction process, currently around 100 million named graphs for uberblic.org. That storage layer does not directly provide SPARQL access, so we push all the output data into a SPARQL endpoint hosted by Talis as well.

What have been the biggest challenges in tackling the integration issues of dispersed data?

It was quite a steep learning curve to do Linked Data not only in an academic environment, but in a reliable, industry-strength set-up. In academia, there was always the excuse that things are just research prototypes. Now that excuse is gone. That’s also where it becomes necessary to manually clean up data. And there are two ways to do that: Either you enable the users to change facts directly in your repository after you have imported the external data (that is what Freebase does), or you facilitate clean-up cycles in the original data source and fetch these updates in real-time. That is what we do.

I believe that data publishing must happen in a distributed style, because then each data source gets taken care of by a specialized group of people using specialized tools. And it’s what you see not only on the web, but also inside organizations and enterprises. But consuming data trough centralized APIs is more than just convenient. We all use Google
or another search engine as a central access point to web pages which are published in a distributed way all over the web, don’t we? Can you imagine today researching a topic on the web without the centralization power of search engines, just by following links across web sites, like in the old days?

When we built the Uberblic Platform, some of the things I imagined to be large headaches, like schema mapping, turned out to work really well. Those pathologic cases you often see in academic “challenges” are – well – pathologic. It’s not necessary to solve them fully automatically through super-intelligent algorithms. Much more important than the sophistication of your algorithms are well designed workflows so that the user becomes a part of the solution. And that’s not about crowd-sourcing or swarm intelligence, the editorial curating of schema mappings and object reconciliation can be done just by a small team of people. If they have the right set of tools.

What are the next plans with uberblic.org? Where will the journey go?

Uberblic.org will continue to integrate more interesting and useful data sources from the web, and we will start making more APIs available to web developers to build their applications on top. We are also looking for partners who are interested in developing applications and have been struggling in the past to get the cross-source data from the web they need.

The work on improving uberblic.org will also benefit our Uberblic Platform, and hence our clients who use that same software for integrating organizational data sources with each other and with the web of data.

About Georgi Kobilarov

Georgi is founder and managing director of Uberblic Labs, a company based in Berlin specialized in Linked Data integration. He worked as a research associate in the Web-based Systems Group at Freie Universität Berlin and as a visiting researcher at Hewlett Packard Labs Bristol. As co-founder and lead developer of DBpedia, he was also a day-one contributor to the Linking Open Data project. Georgi is consulting with the BBC on several Linked Data related projects. He organizes the Web of Data Meetup London, a bi-yearly gathering of the UK Linked Data community. Georgi graduated with a Diplom in business administration from Freie Universität Berlin and has many years of work experience as a software developer. Visit his blog: http://blog.georgikobilarov.com

Sphere: Related Content

Interview with Marco Neumann: “It’s definitely an exciting time to be on the Semantic Web!”

March 25, 2010 By: Tassilo Pellegrini Category: Linked Data & Open Data, Miscellaneous, Semantic Web Applications, Software Development No Comments →

Marco Neumann is an Information Scientist and CEO of KONA a consulting and technology service company based in New York City. The Semantic Web activist is an invited expert to the W3C HTML 5 working group. He recently started a discussion on the challenges and difficulties in bringing the Semantic Web into business. SWC asked him for some additional comments.

Marco, you recently initiated a discussion in a Google Group on the difficulty to change Semantic Web standards. What was the background of the discussion? Where do you perceive a need for action?

It’s not so much about changing this existing standards but the challenge to bring them into the world of practitioners and standards developers. The language used in W3C recommendations quite frequently requires advanced topic knowledge and familiarity with the jargon of the discussion about the respective technologies. I recently discussed this with a senior standards maven at the W3C and got the answer that the recommendations can’t be changed retrospectively and that they are intended to be used primarily by vendors for implementation purposes.

Well this might be the case but I also got the impression that Tim Berners-Lee objective for the W3C is primarily to meet the needs of a larger community. And the W3C took this into account for most of the Semantic Web recommendations in the past. Something I still find amazing is the fact that the work process at the W3C is partially and the recommendations are entirely publicly accessible. Though we definitely still need more and better tools to work with semantic web data, higher quality documentation and last but not least more user adoption on the web.

Critics of the Semantic Web often refer to the slow uptake of Semantic Web standards by industry. Is standards adoption actually a valid and sufficient metric to evaluate the maturity of a standard? What would be needed to accelerate the uptake?

I think we might see a similar scenario to the uptake of HTML in the early 90s, a relatively small number of technology mavens will pave the way towards making the Semantic Web more attractive as a technology solution for a wide range of applications and will successfully publish open data before we see business application developers make use of Semantic Web standards.

The availability of trustable and quality approved RDF data is crucial for the success of the Semantic Web. Given the fact that the aggregation business on the WWW is highly concentrated the corresponding formula is simple: If Google just consumes but does not give back RDF the Semantic Web won’t scale. Do you agree?

Yes and no. Yes we need better and more semantic data on the Web, but we will also need better ways to deal with trust in a lightweight and web friendly fashion. I currently see a number of semi automated approaches emerging  that could scale on the web. An example are distributed user based recommendation systems to validate authenticity, open Wikipedia style community evaluation and content curation a la freebase. Increased public accountability for data producers might be an interesting venue as well. In regards to Google I’d say web search engines will go where the web goes. A problem I might see arising is that web search engines will initially develop their own standards to deal with the emerging Semantic Web and confuse users on the web or might pursue a time consuming power play with the W3C. I see a little bit of that in the current discussion in the HTML 5 working group.

As we know from social sciences technological standards are necessary but always incomplete and unsatisfactory. From a standards design and outreach perspective: What would it need to make the Semantic Web flourish?

I’m not sure if we really know all that much about the laws of innovation and the evolution of technology standards at this point. If we draw from the short experience with the World Wide Web I would come to the conclusion that innovation takes place in small to medium size teams that pursue an independent vision of how services should be delivered and how the technology should be designed. In addition Tim Berners-Lee’s encourages the production of lots and lots of data to bootstrap the Semantic Web and create a pull for services in the industry. And indeed we really see some traction for example with the Linked Open Data and Open Government initiatives. It’s definitely an exciting time to be on the Semantic Web!

About Marco Neumann

Marco Neumann is an Information Scientist and CEO of KONA a consulting and technology service company based in New York City. KONA provides semantic technologies to businesses solutions and adds value to products and services in a highly networked economy. In addition Marco currently acts as an Invited Expert to the W3C on the HTML 5 working group and is the director of the global semantic social network lotico.com.

Sphere: Related Content

Linking Open Data to Thesaurus Management

February 16, 2010 By: Tassilo Pellegrini Category: Corporate Semantic Web, Knowledge Management, Linked Data & Open Data, Search Engines, Semantic Web Applications, Software Development 2 Comments →

The Vienna-based company punkt. netServices is just about to release a demo version of their PoolParty service, a SKOS-based thesaurus management tool with linked data capabilities. I had the chance to pre-read a white paper and test their service. Here is a brief overview. You can also try a demo.

Purpose

Poolparty was conceived to facilitate various applications like

  • Semantic search engines
  • Recommender systems (similarity search)
  • Corporate bookmarking
  • Annotation- & tag recommender systems
  • Autocomplete services and facetted browsing.

These use cases can be either achieved by using PoolParty stand-alone or by integrating it with existing Enterprise Search Engines and Document Management Systems or Enterprise Wikis.

Thesaurus Management

PoolParty is aiming to be easy to use for people without a strong Semantic Web background or special technical skills. The GUI is entirely web-based and utilizes AJAX so the user can e.g. quickly merge two concepts via drag & drop. An overview over the thesaurus can be gained with a tree or a graph view on the concepts.

poolparty-blueskin

PoolParty also helps to semi-automatically add concepts to a thesaurus as it can be used to analyse documents (e.g. web pages or PDF files) relevant to a thesaurus’ domain in order to glean candidate terms. This is done by the key-phrase extractor of KEA. The extracted terms can be selected by the user, thereby becoming “free concepts” which later can be integrated into the thesaurus, turning them into “approved concepts”.

Documents can be searched in various ways – either by keyword search in the full text, by searching for their tags or by semantic search and similarity search. The latter takes not only a concept’s preferred label into account, but also its synonyms and the labels of its related concepts are considered in the search. The user might manually remove query terms used in semantic search. Boost values for the various relations considered in semantic search may also be adjusted. In the same way the recommendation mechanism for document similarity calculation works.

PoolParty by default also publishes a Semantic Wiki version of its thesauri, which provides an alternative way to browse and edit concepts. Through this feature anyone can get read access to a thesaurus, and optionally also edit, add or delete labels of concepts. Search and autocomplete functions are available here as well. The Wiki’s XHTML source is also enriched with RDFa, thereby exposing all RDF metadata associated with a concept to be picked up by RDF search engines and crawlers. (See two examples: Cocktail thesaurusStandard Thesaurus for Economics)

PoolParty also supports the import of thesauri in SKOS (including several consistency checks) or Zthes format. Those functionalities can also be consumed as stand-alone web services via PoolParty SKOS Services. Additionaly, lists of concepts and their labels can also be imported via CSV files.

Linked (Open) Data

PoolParty not only publishes its thesauri as Linked Open Data (in addition to a SPARQL endpoint), but it also consumes LOD in order to expand thesauri with information from LOD sources.

Concepts in the thesaurus can be linked to e.g. DBpedia  via a service like Georgi Kobilarov’s DBpedia lookup service, which takes the label of a concept and returns possible matching candidates. The system suggests relevant resources from DBpedia and the user can select the one that matches the concept from his thesaurus, thereby creating a skos:exactMatch relation between the concept URI in PoolParty and the DBpedia URI. The same approach can be used to link to other SKOS thesauri available as Linked Data.

poolparty-lod

Other triples can also be retrieved from the target data source, e.g. the DBpedia abstract can become a skos:definition and geographical coordinates can be imported and be used to display the location of a concept on the map, where appropriate. The DBpedia category information may also be used to retrieve additional concepts of that category as siblings of the concept in focus, in order to populate the thesaurus.

PoolParty is capable of importing a SKOS thesaurus from a Linked Data server, and may also receive updates to thesauri imported this way. This feature has been implemented in the course of the KiWi  project funded by the European Commission. KiWi also contains SKOS thesauri and exposes them as LOD. Both systems can read a thesaurus via the other’s LOD interfaces and may write it to their own store. This is facilitated by special Linked Data URIs that return e.g. all the top-concepts of a thesaurus, with pointers to the URIs of their narrower concepts, which allow other systems to retrieve a complete thesaurus through iterative dereferencing of concept URIs.

Additionally KiWi and PoolParty publish lists of concepts created, modified, merged or deleted within user specified time-frames. With this information the systems can learn about updates to one of their thesauri in an external system. They then can compare the versions of concepts in both stores and may write according updates to their own store.

This means each system decides autonomously which data it accepts and there is no risk of a system pushing data that might lead to inconsistencies into an external store. Data transfer and communication are achieved using REST/HTTP, no other protocols or middleware are necessary. Also no rights management for each external systems is needed, which otherwise would have to be configured separately for each source.

Technology

The software is written in Java and utilizes the SAIL API, so it can be used with various triple stores. The thesaurus management itself (viewing, creating and editing SKOS concepts and their relationships) can be done in an AJAX Frontend based on Yahoo User Interface (YUI). Editing of labels can alternatively be done in a Wiki style HTML frontend. For key-phrase extraction from documents PoolParty uses a modified version of the KEA 5 API, which is extended for the use of controlled vocabularies stored in a SAIL Repository (this module is available under GNU GPL). The analysed documents can be stored and indexed in Lucene/Solr or any other (enterprise) search system along with extracted and semantically related concepts.

Reblog this post [with Zemanta]
Sphere: Related Content