Thomas Thurner

Hjalmar Gislason: “What I call the emerging field of Data Market.”

Open Government Data, and Open Data provided by the corporate sector, stimulate an upcoming market segment: Commercial Open Data Services. The islandic StartUp datamarket.com is on of the emerging companies in this field. Thomas Thurner from Semantic Web Company had the chance to talk to Hjalmar Gislason, founder and CEO of datamarket.com.

Semantic Puzzle: What’s the business idea behind datamarket.com? Whom do you expect to pay for what?
Hjalmar Gislason: From the end-user perspective its easiest to describe datamarket.com as a search engine for statistical data, a “Google for statistics” if you will. Any data that is already available open and for free out there will still be open and free on DataMarket, just easier to find, use, compare and download from a single source. While the audience for a search engine for statistical content is obviously way smaller than for text content, a significant part of that audience is business users, looking for data for business reasons. This means that there are more direct and lucrative methods to monetize the usage than simply contextual ads – especially in reselling access to premium data. This is a market that already turns over billions of dollars annually, but is as far from any of the “2.0 world” as one could possibly imagine (think Bloomberg, ReutersFactSet). We believe there is an opportunity to disrupt a part of their business with a freemium approach, and furthermore open up the data market by reaching a business audience outside the narrowly defined financial user base that these companies cater to. There is data out there – free and premium alike – that can help almost any business make better plans and decisions. Connecting people and businesses to the data that they need will release phenomenal value. Tapping into just a fraction of that will be a hugely successful business for those that get it right.

Semantic Puzzle: Can you tell me a bit about the technological framework behind datamarket.com? How is the content from third parties is feeded
into the system, and which APIs do you use? As you provide mainly XLS and CSV, have you thought, to provide data also als XML in future?
Hjalmar Gislason: The backend system is written in Python. We read data from the sources in various different formats, ranging from Excel files and even scraping of web pages to proprietary APIs and Web Services. The data is then stored in a normalized format in a Postgres database that we’re using in a pretty unique way to be able to efficiently store the billions of time series and fact values that the system will eventually hold (currently at around 100 million time series and 600 million fact values). The web site is also written in Python, using the Django framework, but also making use of a lot of javascript libraries (and a bunch of our own code) to allow for an exciting user experience. We’re currently using a Flash-based solution called amCharts for the charts, but have already taken some steps to replace that with our own solution that we’ve written on top of the excellent Protovis visualization library. While you are right that the export formats we provide for end users are XLS, CSV and images (for exporting the graphs), our REST-ful API actually supports XML and JSON formats as well. So we already provide data as XML.

Semantic Puzzle: As you for sure know Tim Berners-Lee’s 5-stars scheme for OGD-Providers. Where do you se your own service in this framework?
Hjalmar Gislason: Any fact value, time series and data set on DataMarket is “addressable” with a direct URL using our API. In that sense, all the data on DataMarket is four-star data according to Berners-Lee’s definition. In many cases we’re integrating to data that is only one or two star data, so just by integrating it into our system we’ve moved it a few notches up that ladder. In some cases we’ve even been helping organizations publishing data for the first time, taking the data from 0 to 4 stars in one go. We’ve been toying around with several ideas that would take – or enable users to take – the data all the way to 5-star status, but that’s still just on the drawing table.

Semantic Puzzle: You re-use a lot of Open Data comming from the Island Government. Is there also a state-owned Data Portal for Island, or is
your service a “commercial replacement” for such a public effort?
Hjalmar Gislason: There is no government-operated data portal in Iceland, and to my knowledge there are no plans for implementing one yet. Sadly there are several more pressing issues in terms of eGovernment here that take higher priority. We don’t see our efforts as a replacement for such a portal, but we have managed to fulfill a little part of that role when it comes to statistical data. We’ve also been really vocal about the benefits of open data and among other things been influential in launching an open data wiki - opingogn.net (Icelandic only) – that exmplains the concepts with examples and use cases and attempts to list in a directory listing as many sources of government data as possible. There is some movement, but as an open data enthusiast I’d really like to see things happening faster. As a matter of fact I think there are reasons for Iceland to be extra enthusiastic about open data to increase transparency and restore trust after the crash of the banks and the economic system in 2008.

Semantic Puzzle: A lot of commercial Open Data Services (Socrata, Factual, Google …) are evolving at the moment. What do you think, which development this market segment will face in the next month and years, and are you able to list your sight on the crucial factors for such business?
Hjalmar Gislason: I’ve been writing quite a lot up on the developments in this industry on our blog. One of the things I’ve written the most about is what I call the Emerging field of Data Market“. I define “data markets” as “Services that make it easy to find data from a range of secondary data sources, then consume or acquire the data in a usable – and often unified – format.” Many of these services are trying to create marketplaces for data, envisioning that data providers can offer their data sets for sale to data seekers. As there are several players in this space already, I believe we’ll see many of them try to differentiate themselves in 2011 by focusing on specific types of data. There are definitely opportunities in building specialized data markets for geospatial data, for statistics and for enormous scientific data sets – to name a few types – and each comes with their own challenges, target audiences and preferred approaches. In the spirit of doing one thing and doing it well, I think most of these projects will want to see success in one such segment of the market before generalizing – or consolidating.


The interviewee: Hjalmar is a successful entrepreneur, founder of three startups in the gaming, mobile and web sectors since 1996. Prior to launching DataMarket, Hjalmar worked on new media and business development for companies in the Skipti Group (owners of Iceland Telecom) after their acquisition of his search startup – Spurl. Hjalmar offers a mix of business, strategy and technical expertise.

Thomas Schandl

Transforming spreadsheets into SKOS with Google Refine

Looking for high quality enterprise vocabularies we recently turned our attention to the Global Industry Classification Standard (GICS), which is an industry taxonomy designed to categorize any private company. It was developed by Morgan Stanley Capital International and Standard & Poor’s and is mainly used by the global financial community to aid in the investment research process.

It is available for download as .xls spreadsheet files in several languages. Of course it would be much better to have this valuable taxonomy in a standard and machine-readable format. The Simple Knowledge Organization System SKOS is a perfect fit for a taxonomy like GICS. But how to turn a spreadsheet into SKOS with minimal manual effort?

I chose to try Google Refine for this task, as recently a promising RDF extension had been released by DERI‘s Fadi Maali and Richard Cyganiak.

Google Refine is “a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases”. Previously it was known as Freebase Gridworks which is now further developed by Google since its acquisition of Metaweb.

Refine

Google Refine UI

Refine is a very useful tool to filter and consequently transform rows, colums and cells according to customizable patterns.

After applying all necessary transformations to the spreadsheet one can edit the “RDF Skeleton”, where the columns can be mapped to literals, RDF properties and RDF classes (which can be imported from their namespaces).

RDF Sekeleton

Editing the RDF Sekeleton

Once you got your valid SKOS model ready you can export it in RDF/XML or Turtle format. Then you may want to load it into an ontology editor like Protégé or a thesaurus management tool like PoolParty in order to build upon it or connect it to other knowledge models. With PoolParty the GICS taxonomy can also be utilized to tag and categorize documents, provide semantic search and facetted navigation and it can be published as Linked Data without further effort.

GICS in PoolParty screenshot

GICS loaded in PoolParty

Working with Refine and its RDF extension was easy and fun. It’s even possible to isolate and save the transformation steps done with Refine, so one can re-apply them on similar structured spreadsheets. This came in very handy as GICS is published in nine languages and as many separate, identically structured spreadsheets.

Thomas Schandl

Drupal and the Semantic Web – Interview with Stéphane Corlosquet

Stéphane Corlosquet has been the main driving force in incorporating Semantic Web capabilities into Drupal. In the recent release of Drupal 7, Semantic Web technologies became part of the core of this popular CMS, which is used to power at least 1% of all the world’s web sites.

Drupal is the leading CMS when it comes to implementing Semantic Web standards. What are the reasons for this, what makes Drupal such a good fit for Semantic Web technologies?

Historically, Drupal is known to be web standard compliant. It supported the RDF-based aggregation format known as RSS 1.0 as early as in 2001, which was later upgraded to RSS 2.0. The Drupal community prides itself in valid HTML code, not only for the code generated by Drupal, but also by taking the extra step of automatically fixing faulty HTML entered by its users. Drupal has been using XHTML since its version 4.0 in 2002. The next logical step beyond XHTML was to add a layer of semantics with the RDFa standard, a W3C recommendation published in 2008.

There are definitely many reasons that contributed to the addition of RDFa into Drupal 7. The first comes from the Drupal project lead, Dries Buytaert, who is passionate about the web and open source. Secondly, the growing Drupal community is very web savvy and includes many experts from different backgrounds in accessilibity, CSS, HTML, security etc. As a result, every release of Drupal includes many latest standards. The community meets twice a year at conferences (DrupalCons), thes events play a great role in hashing out what technologies or designs will be incorporated into the next version of Drupal. Because of the flexibility of its internal architecture, Drupal is able to keep up with the latest web standards. Content in Drupal is very structured and provides site administrators with a user interface to build the site structure they want, using entity types, content types, fields and taxonomies for categorization. When it comes to other CMSs, Joomla!’s community appears to be more fragmented with a core software that is not as extensible as Drupal and WordPress is more of a blogging platform, so turning it into a full blown CMS can be challenging. Both WordPress and Joomla! are in fact adapting the concept of Drupal’s Content Construction Kit (CCK) to their software but they have not yet reached the same level of maturity as Drupal.

A common objection to the adoption of Semantic Web technologies is that the learning curve is steep and that it is too complicated for many web developers to get into it. How can Drupal 7 change that? Which features accessible for the average web site operator will it offer?

Semantic Web technologies don’t have to be complicated when applied to simple use cases! We purposely chose only of a subset of semantic web technologies to integrate into the core of Drupal, keeping the learning curve for the Drupal developers and users as low as possible. The main technology is RDFa which includes the notions of vocabularies (a schema, or collection of attributes) as well as Compact URIs (CURIEs) which make the authoring of RDFa easier. In fact, some web developers might have come across these notions before when working with Dublin Core in the meta tags as such dc:title or dc:date.

Which benefits will web site owners get when they switch to a semantics enabled Drupal 7?

Google and Bing increasingly rely on machine-readable structured data from the websites that they crawl. The design of Drupal 7 embeds semantic meta data that makes machine-to-machine (M2M) search native for a Drupal 7 website. RDFa can add value by giving search engines more details such as the latitude and longitude of a venue for display on a map; or providing the ISO date format for localization and proper display in the search results for different countries.

What are your hopes regarding the development of other applications that either provide or consume data from D7 sites? Which improvements of standards, best practices or (lightweight) ontologies in the Semantic Web community would you like to see?

Services like Sig.ma are already able to collect semantic data from different sources and display it in new ways in the form of mash-ups. Eventually, these services that consume semantic data will not be just Drupal specific, as more platforms jump on the semantic web band wagon. What I hope to see as improvements or best practices in the future are more well-maintained vocabularies. Many of the existing vocabularies are over engineered, some fail to de-reference properly. Their is also some work to be done in order to improve the tooling made available to web developers as well as introducing the simple concepts of Linked Data to web developers via easy to read documentation.

Thank you for this interview, Stéphane!

Thomas Thurner

EU-Report on the requirements for a paneuropean Open Government Data Portal

The recently published report on a hearing of an experts in Luxembourg this November, provides a snap-shoot on the discussion if a central open data infrastructure may make sense. The experts group list several positive effects like union-wide comparability of some government data set, as well as the role of being motor for national and regional initiatives. It is stressed several times, that a swift progress, in coming those plans reality, is crucial for success.

Read more at: Report – Technical workshop on the goals and requirements for a pan-European data portal