Statistics service

This page presents a SPARQL service to expose the statistics of your data that the query engine uses to optimize user queries. The statistics summarize common patterns in the data from single predicates to star-shaped subgraphs and binary chains, and their occurrences in the graph.

<details open markdown="block"> <summary> Page Contents </summary> 1. TOC </details>

Introduction

The Stardog query engine collects rich statistics to understand the structure of the data in order to better optimize user queries. Some of those statistics can be useful not only to the query optimizer but to the users themselves, particularly to those less familiar with the dataset. The statistics can reveal such insights into the data as, for example, which predicates are used in the data and compose with each other into more complex patterns (such as stars or chains). It can also be used as input to other tools, such as ML models, which need to understand the data in order to perform their functions best, for example, to generate SPARQL queries.

The service at <tag:stardog:api:statistics> exposes the statistics as an RDF graph. The client is expected to either consume it and process on the application side or add it into a named graph in Stardog (the same or a different database) and query with SPARQL. Here is a small example of how statistics can be obtained:

prefix stardog: <tag:stardog:api:>

CONSTRUCT { ?s ?p ?o } WHERE {
    service stardog:statistics {
        [] stardog:rdf:subject ?s ;
           stardog:rdf:predicate ?p ;
           stardog:rdf:object ?o .
    }
}

This query returns an RDF graph representing the summary statistics. It can be added to a named graph simply by replacing the CONSTRUCT part with insert { graph <urn:statistics> { ?s ?p ?o } } where ....

Statistics overview

Stardog collects summary statistics on the following patterns of RDF data:

Predicates. For each predicate Stardog computes the number of edges in the graph, the domain and range size, and average in- and out-degrees.
Star-shaped subgraphs (SSG). Each SSG is identified by a set of predicate IRIs which emanate from a single node in the data. Such patterns are a common way to represent business objects with properties, for example, a product can be represented as a single node with outgoing predicates like rdfs:label, :producer, :price, etc. SSGs are often direct representations of tables in RDF. The number of distinct SSGs in enterprise datasets is often small (less than a thousand) and the statistics capture the number of central nodes for each as well as counts for each predicate in an SSG.
Types. As far as statistics is concerned, types are entities used in the object position of rdf:type triples. Most frequent types are reported together with their estimated number of instances and the list of predicates that those instances have in the data. By default, Stardog only exposes up to 100 types that have at least 1000 instances in the data. This can be configured using predicates stardog:statistics:types:limit and stardog:statistics:types:instances:min in the service pattern.

Instances of the same type could have different sets of predicates. For example, one person's data can have only their name and social security properties while another may also include employment information. In general, a single type can therefore be associated with multiple sets of predicates, i.e. SSGs. However, for the sake of simplicity, Stardog only reports a single set of predicates per type where some predicates can be missing for some instances and some rarely-used predicates can be omitted.

Binary chains. Binary chains are paths in the graph with a length of two. They're characterized by the left predicate and the right predicate. For each sufficiently frequent chain in the graph, Stardog maintains the number of occurrences.
Undirected binary chains. Similar to the binary chains but both predicates point to the same node. In other words, the pattern also represents a star-shaped subgraph but for incoming rather than outgoing predicates and is restricted to two predicates.

The statistics is collected over all (named and default) graphs in the data that resides in Stardog, i.e. not in Virtual Graphs.

RDF data model for statistics

The service uses the following RDF vocabulary for exposing the statistics. The stardog prefix stands for <tag:stardog:api:>.

stardog:statistics:PredicateStatistics: type for predicate statistics.
- stardog:rdf:predicate: specifies the predicate IRI.
- stardog:statistics:count: specifies the number of edges with this predicate in the data.
- stardog:statistics:domain: specifies the size of the predicate's domain.
- stardog:statistics:range: specifies the size of the predicate's range.
- stardog:statistics:in-degree: specifies the average in-degree for the predicate.
- stardog:statistics:out-degree: specifies the average out-degree for the predicate.
- stardog:statistics:chains: links the predicate statistics instance to the resource describing chains (including undirected ones):
  - stardog:rdf:predicate: links the chain instance to the other predicate's IRI.
  - stardog:statistics:count: specifies the number of occurrences of the chain in the data.
  - stardog:statistics:inverted: if true, the chain is undirected, i.e. the other predicate's direction is inverted.
stardog:statistics:SubjectStarStatistics: type for SSG statistics.
- stardog:rdf:predicate: links an SSG resource to a resource representing an outgoing predicate.
- stardog:statistics:name: links a predicate resource to its IRI in the data.
- stardog:statistics:count: used to specify both the number of occurrences of the SSG in the data and each predicate within the SSG.
Each type is an instance of rdfs:Class in the output. The type entity additionally uses stardog:statistics:count and stardog:rdf:predicate predicates to link to the estimated number of instances and the list of predicates in the data.

Here are examples of querying the RDF representation of the statistics after saving it in <urn:statistics>:

Querying the subject-star subgraph statistics

This query returns a comma-separated list of predicates for each SSG along with the number of occurrences in the data:

prefix stardog: <tag:stardog:api:>

select (group_concat(?pname; separator = ",") as ?predicates) ?subjects from <urn:statistics> {
  select * {
    ?s a stardog:statistics:SubjectStarStatistics ;
       stardog:rdf:predicate/stardog:statistics:name ?p ;
       stardog:statistics:count ?subjects .
    bind(localname(?p) as ?pname)
  } order by ?pname
}
group by ?s ?subjects

Querying the chain statistics

This query returns the chain statistics ordered by the left predicate. For each chain, it returns the pair of predicates and the number of occurrences in the data.

prefix stardog: <tag:stardog:api:>

select ?p ?q ?count from <urn:statistics> {
  [] a stardog:statistics:PredicateStatistics ;
     stardog:rdf:predicate ?p ;
     stardog:statistics:chains [ stardog:rdf:predicate ?q ; stardog:statistics:count ?count ; stardog:statistics:inverted false ]
}
order by ?p

Querying for types

The following query returns types with their estimated number of instances and the set of properties:

prefix stardog: <tag:stardog:api:>

select ?type ?count (group_concat(?pname; separator = \", \") as ?predicates)
from <urn:statistics> {
  select ?type ?pname ?count {
    ?type a rdfs:Class ;
          stardog:statistics:count ?count ;
          stardog:rdf:predicate ?p ;
    bind(localname(?p) as ?pname)
  } order by ?pname
}
group by ?type ?count