The Semantic Puzzle

Christian Blaschke

SPARQL analytics proves boxers live dangerously

You have always thought that SPARQL is only a query language for RDF data? Then think again, because SPARQL can also be used to implement some cool analytics. I show here two queries that demonstrate that principle.

For simplicity we use a publicly available dataset of DBpediaDBpedia is a project aiming to extract structured information from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources, ... on an open SPARQL endpoint: http://live.dbpedia.org/sparql (execute with default graphIn mathematics, a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions called vertices, and the links that connect some pairs of vertices are called edges. ...http://dbpedia.org).

Mean life expectancy for different sports

The query shown here starts from the class dbp:Athlete and retrieves sub classes thereof that cover different sports. With that athletes of that areas are obtained and their birth and death dates (i.e. we only take into account deceased individuals). From the dates the years are extracted. Here a regular expression is used because the SPARQL function to extract years from a literal of a date type returned errors and could not be used. From the birth and death years the age is calculated (we filter for a range of 20 to 100 years because in data sources like this erroneous entries have always to be accounted for). Then the data is simply grouped and we count for each sport the number of athletes that were selected and the average age they reached.

prefix dbp:<http://dbpedia.org/ontology/>
select ?athleteGroupEN (count(?athlete) as ?count) (avg(?age) as ?ageAvg)
where {
filter(?age >= 20 && ?age <= 100) .
{
select distinct ?athleteGroupEN ?athlete (?deathYear - ?birthYear as ?age)
where {
?subOfAthlete rdfs:subClassOf dbp:Athlete .
?subOfAthlete rdfs:label ?athleteGroup filter(lang(?athleteGroup) = "en") .
bind(str(?athleteGroup) as ?athleteGroupEN)
?athlete a ?subOfAthlete .
?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
}
}
} group by ?athleteGroupEN having (count(?athlete) >= 25) order by ?ageAvg

The results are not unexpected and show that athletes in the area of motor sports, wresting and boxing die at younger age. On the other hand horse riders, but also tennis and golf players live on average clearly longer.

athleteGroupEN
count
ageAvg
wrestler 693 58.962481962481962
winter sport Player 1775 66.60169014084507
tennis player 577 71.483535528596187
table tennis player 45 68.733333333333333
swimmer 402 68.674129353233831
soccer player 6572 63.992391965916007
snooker player 25 70.12
rugby player 1452 67.272038567493113
rower 69 63.057971014492754
poker player 30 66.866666666666667
national collegiate athletic association athlete 44 68.090909090909091
motorsport racer 1237 58.117219078415521
martial artist 197 67.157360406091371
jockey (horse racer) 139 65.992805755395683
horse rider 181 74.651933701657459
gymnast 175 65.805714285714286
gridiron football player 4247 67.713680244878738
golf player 400 71.13
Gaelic games player 95 70.589473684210526
cyclist 1370 67.469343065693431
cricketer 4998 68.420368147258904
chess player 45 70.244444444444444
boxer 869 60.352128883774453
bodybuilder 27 52
basketball player 822 66.165450121654501
baseball player 9207 68.611382643640708
Australian rules football player 2790 69.52831541218638

This is especially relevant when that data is large and one would have to extract it from the database and import it into another tool to do the counting and calculations.

Simple statistical measures over life expectancy

Another standard statistical measure is the standard deviation. A good description about how to calculate it can be found for example here. We start again with the class dbp:Athlete and calculate the ages they reached (this time for the entire class dbp:Athlete not its sub classes). Another thing we need are the squares of the ages that we calculate with “(?age * ?age as ?ageSquare)”. At the next stage we count the number of athletes in the result, and calculate the average age, the square of the sums and the sum of the squares. With those values we can calculate in the next step the standard deviation of the ages in our data set. Note that SPARQL does not specify a function for calculating square roots but RDF stores like VirtuosoVirtuoso Universal Server is a middleware and database engine hybrid that combines the functionality of a traditional RDBMS, ORDBMS, virtual database, RDF, XML, free-text, web application server and file server functionality in a single system. Rather than have dedicated servers for each of the ... (that hosts the DBpedia data) provide additional functions like bif:sqrt for calculating the square root of a value.

prefix dbp:<http://dbpedia.org/ontology/>
select ?count ?ageAvg (bif:sqrt((?ageSquareSum - (strdt(?ageSumSquare,xsd:double) / ?count)) / (?count - 1)) as ?standDev)
where {
{
select (count(?athlete) as ?count) (avg(?age) as ?ageAvg) (sum(?age) * sum(?age) as ?ageSumSquare) (sum(?ageSquare) as ?ageSquareSum)
where {
{
select ?subOfAthlete ?athlete ?age (?age * ?age as ?ageSquare)
where {
filter(?age >= 20 && ?age <= 100) .
{
select distinct ?subOfAthlete ?athlete (?deathYear - ?birthYear as ?age)
where {
?subOfAthlete rdfs:subClassOf dbp:Athlete .
?athlete a ?subOfAthlete .
?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
}
}
}
}
}
}
}

 

count
ageAvg
standDev
38542 66.876290799647138 17.6479

These examples show that SPARQL is quite powerful and a lot more than “just” a query language for RDF data but that it is possible to implement basic statistical methods directly at the level of the triple store without the need to extract the data and import it into another tool.