The Semantic Puzzle

Timea Turdean

Ready to connect to the Semantic Web – now what?

As an open data fan or as someone who is just looking to learn how to publish data on the Web and distribute it through the Semantic Web you will be facing the question “How to describe the dataset that I want to publish?” The same question is asked also by people who apply for a publicly funded project at the European Commission and want to have a Data Management plan. Next we are going to discuss possibilities which help describe the dataset to be published. 

The goal of publishing the data should be to make it available for access or download and to make it interoperable. One of the big benefits is to make the data available for software applications which in turn means the datasets have to be machine-readable. From the perspective of a software developer some additional information than just name, author, owner, date…  would be helpful:

  • the condition for re-use (rights, licenses)
  • the specific coverage of the dataset (type of data, thematic coverage, geographic coverage)
  • technical specifications to retrieve and parse an instance (a distribution) of the dataset (format, protocol)
  • the features/dimensions covered by the dataset (temperature, time, salinity, gene, coordinates)
  • the semantics of the  features/dimensions (unit of measure, time granularity, syntax, reference taxonomies)

To describe a dataset the best is always to look first at existing standards and existing vocabularies. The answer is not found looking only at one vocabulary but at several.

Data Catalog VocabularyDCAT is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web. (DCATDCAT is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web.)

DCAT is an RDF Schema vocabulary for representing data catalogs. It is an RDF vocabulary for describing any dataset, which can be standalone or part of a catalog.

Vocabulary of Interlinked DatasetsVocabulary and a set of instructions that enables the discovery and usage of linked datasets. (http://vocab.deri.ie/void) (VoIDVocabulary and a set of instructions that enables the discovery and usage of linked datasets. (http://vocab.deri.ie/void))

VoID is an RDF vocabulary, and a set of instructions, that enable the discovery and usage of linked data sets. VOID is an RDF vocabulary for expressing metadata about RDF datasets.

Data CubeThere are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using RDF. ... vocabulary

Data Cube vocabulary is focused purely on the publication of multi-dimensional data on the web. It is an RDF vocabulary for describing statistical datasets.

Asset Description Metadata SchemaADMS, the Asset Description Metadata Schema, is a profile of DCAT for describing so-called Semantic Assets (or just 'Assets'), that is, highly reusable metadata (e.g. xml schemata, generic data models) and reference data (e.g. code lists, taxonomies, dictionaries, vocabularies) that are used for ... (ADMSADMS, the Asset Description Metadata Schema, is a profile of DCAT for describing so-called Semantic Assets (or just 'Assets'), that is, highly reusable metadata (e.g. xml schemata, generic data models) and reference data (e.g. code lists, taxonomies, dictionaries, vocabularies) that are used for ...)

ADMS is a W3CThe World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web (abbreviated WWW or W3). Founded and currently led by Tim Berners-Lee, the consortium is made up of member organizations which maintain full-time staff for the purpose of working together ... standard developed in 2013 and is a profile of DCAT, used to describe semantic assets.

You will find only partial answers of how to describe your dataset in existing vocabularies while some aspects are missing or complicated to express.

  1. Type of data – there is no specific property for the type of data covered in a dataset. This value should be machine readable which means it should be standardized, possibly to an URI which can be de-reference-able to a thing. And this ‘thing’ should be part of an authority list/taxonomyTaxonomy is the practice and science of classification. The word finds its roots in the Greek τάξις, taxis (meaning 'order' or 'arrangement') and νόμος, nomos (meaning 'law' or 'science'). Taxonomy uses taxonomic units, known as taxa. In addition, the word is also used as a count noun: ... which is not existing yet. However one can use the adms:representationTechnique, which gives more information about the format in which a dataset is released. This points only to dcterms:format and dcat:mediaType.
  2. Technical properties like – format, protocol etc.
    • There is no property for protocol and again these values should be machine-readable, standardized possibly to an URI.
    • VoID can help with the protocol metadata but only for RDF datasets: dataDump, sparqlEndpoint.
  3. Dimensions of a dataset.
    • SDMX defines a dimension as “A statistical concept used, in combination with other statistical concepts, to identify a statistical series or single observations.” Dimensions in a dataset can therefore be called features, predictors, or variables (depending on the domain). One can use dc:conformsTo and use a dc:Standard if the dataset dimensions can be defined by a formalized standard. Otherwise statistical vocabularies can help with this aspect which can become quite complex. One can use the Data Cube vocabulary specificallyqd:DimensionProperty, qd:AttributeProperty, qd:MeasureProperty, qd:CodedProperty in combination with skos:Concept and sdmx:ConceptRole.
      image2015-11-4 12
  4. Data provenance – there is the dc:source that can be used at dataset level but there is no solution if we want to specify the source at data record level.

In the end one needs to combine different vocabularies to best describe a dataset.

DescribeAdataset 2

The tools out there used for helping in publishing data seem to be missing one or more of the above mentioned parts.

  • CKANSystem for the creation of a registry of open knowledge packages and projects that enables to find, share and reuse open content and data, especially in ways that are machine automatable. (http://www.ckan.net) maintained by the Open Knowledge Foundation uses most of DCAT and doesn’t describe dimensions.
  • Dataverse created by Harvard University uses a custom vocabulary and doesn’t describe dimensions.
  • CIARD RING uses full DCAT AP with some extended properties (protocol, data type) and local taxonomies with URIs mapped when possible to authorities.
  • OpenAIRE, DataCite (using re3data to search repositories) and Dryad use their own vocabularies.

The solution to these existing issues seem to be in general, introducing custom vocabularies.

References: