This page presents a SPARQL service to expose the statistics of your data that the query engine uses to optimize user queries. The statistics summarize common patterns in the data from single predicates to star-shaped subgraphs and binary chains, and their occurrences in the graph.
<details open markdown="block"> <summary> Page Contents </summary> 1. TOC </details>The Stardog query engine collects rich statistics to understand the structure of the data in order to better optimize user queries. Some of those statistics can be useful not only to the query optimizer but to the users themselves, particularly to those less familiar with the dataset. The statistics can reveal such insights into the data as, for example, which predicates are used in the data and compose with each other into more complex patterns (such as stars or chains). It can also be used as input to other tools, such as ML models, which need to understand the data in order to perform their functions best, for example, to generate SPARQL queries.
The service at <tag:stardog:api:statistics> exposes the statistics as an RDF graph.
The client is expected to either consume it and process on the application side or add it into a named graph in Stardog (the same or a different database) and query with SPARQL.
Here is a small example of how statistics can be obtained:
prefix stardog: <tag:stardog:api:>
CONSTRUCT { ?s ?p ?o } WHERE {
service stardog:statistics {
[] stardog:rdf:subject ?s ;
stardog:rdf:predicate ?p ;
stardog:rdf:object ?o .
}
}
This query returns an RDF graph representing the summary statistics. It can be added to a named graph simply by
replacing the CONSTRUCT part with insert { graph <urn:statistics> { ?s ?p ?o } } where ....
Stardog collects summary statistics on the following patterns of RDF data:
rdfs:label, :producer, :price, etc. SSGs are often direct representations of tables in RDF.
The number of distinct SSGs in enterprise datasets is often small (less than a thousand) and the statistics capture the number of central nodes for each as well as counts for each predicate in an SSG.rdf:type triples. Most frequent types are reported together with their estimated number of instances and the list of predicates that those instances have in the data. By default, Stardog only exposes up to 100 types that have at least 1000 instances in the data. This can be configured using predicates stardog:statistics:types:limit and stardog:statistics:types:instances:min in the service pattern.Instances of the same type could have different sets of predicates. For example, one person's data can have only their name and social security properties while another may also include employment information. In general, a single type can therefore be associated with multiple sets of predicates, i.e. SSGs. However, for the sake of simplicity, Stardog only reports a single set of predicates per type where some predicates can be missing for some instances and some rarely-used predicates can be omitted.
The statistics is collected over all (named and default) graphs in the data that resides in Stardog, i.e. not in Virtual Graphs.
The service uses the following RDF vocabulary for exposing the statistics.
The stardog prefix stands for <tag:stardog:api:>.
stardog:statistics:PredicateStatistics: type for predicate statistics.
stardog:rdf:predicate: specifies the predicate IRI.stardog:statistics:count: specifies the number of edges with this predicate in the data.stardog:statistics:domain: specifies the size of the predicate's domain.stardog:statistics:range: specifies the size of the predicate's range.stardog:statistics:in-degree: specifies the average in-degree for the predicate.stardog:statistics:out-degree: specifies the average out-degree for the predicate.stardog:statistics:chains: links the predicate statistics instance to the resource describing chains (including undirected ones):
stardog:rdf:predicate: links the chain instance to the other predicate's IRI.stardog:statistics:count: specifies the number of occurrences of the chain in the data.stardog:statistics:inverted: if true, the chain is undirected, i.e. the other predicate's direction is inverted.stardog:statistics:SubjectStarStatistics: type for SSG statistics.
stardog:rdf:predicate: links an SSG resource to a resource representing an outgoing predicate.stardog:statistics:name: links a predicate resource to its IRI in the data.stardog:statistics:count: used to specify both the number of occurrences of the SSG in the data and each predicate within the SSG.rdfs:Class in the output. The type entity additionally uses stardog:statistics:count and stardog:rdf:predicate predicates to link to the estimated number of instances and the list of predicates in the data.Here are examples of querying the RDF representation of the statistics after saving it in <urn:statistics>:
This query returns a comma-separated list of predicates for each SSG along with the number of occurrences in the data:
prefix stardog: <tag:stardog:api:>
select (group_concat(?pname; separator = ",") as ?predicates) ?subjects from <urn:statistics> {
select * {
?s a stardog:statistics:SubjectStarStatistics ;
stardog:rdf:predicate/stardog:statistics:name ?p ;
stardog:statistics:count ?subjects .
bind(localname(?p) as ?pname)
} order by ?pname
}
group by ?s ?subjects
This query returns the chain statistics ordered by the left predicate. For each chain, it returns the pair of predicates and the number of occurrences in the data.
prefix stardog: <tag:stardog:api:>
select ?p ?q ?count from <urn:statistics> {
[] a stardog:statistics:PredicateStatistics ;
stardog:rdf:predicate ?p ;
stardog:statistics:chains [ stardog:rdf:predicate ?q ; stardog:statistics:count ?count ; stardog:statistics:inverted false ]
}
order by ?p
The following query returns types with their estimated number of instances and the set of properties:
prefix stardog: <tag:stardog:api:>
select ?type ?count (group_concat(?pname; separator = \", \") as ?predicates)
from <urn:statistics> {
select ?type ?pname ?count {
?type a rdfs:Class ;
stardog:statistics:count ?count ;
stardog:rdf:predicate ?p ;
bind(localname(?p) as ?pname)
} order by ?pname
}
group by ?type ?count