Sampling service

This page discusses the Sampling Service, which allows for random sampling without replacement, from the set of result triples matched by a SPARQL pattern.

<details open markdown="block"> <summary> Page Contents </summary> 1. TOC </details>

Overview

The Sampling Service allows for random sampling from the set of results matched by a particular SPARQL triple pattern. Sampling without replacement is useful for training and testing ML models, data exploration and visualization.

An example of using the sample service follows below.

The sample is always a subset of the results matched by the enclosed triple pattern, i.e. ?resource a ?resourceType in the following example.
We do not guarantee any particular sampling distribution, e.g. uniform or Gaussian.
Returned sample size may be smaller than the value of smp:size.

prefix smp: <tag:stardog:api:sample:>
SELECT ?resource {
    service <tag:stardog:api:sample> {
          ?resource a ?resourceType .
          [] smp:size 10000 ;
    }
}

There are some preconditions to using the sampling service in a sensible way.

The data is well compacted on disk i.e. after running stardog-admin db optimize, or data was imported during db creation
The sample size is much smaller than the total amount of data matched by the triple pattern.
Sampling only needs to read a subset of underlying data files to produce results.

Best results may be obtained when the query results are unlikely to change: Whether a full triple pattern is used or just a sample. For example, just getting the distinct outgoing predicates for :Product instances. We assume the number of distinct predicates is low, so we just need a good sample of :Product instances to start from.

prefix smp: <tag:stardog:api:sample:>
SELECT DISTINCT ?predicate {
    service <tag:stardog:api:sample> {
          ?resource a :Product .
          [] smp:size 10000 ;
    }
    ?resource ?predicate []
}

Another Example on a lubm generated dataset

prefix lubm: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
prefix smp: <tag:stardog:api:sample:>

select distinct ?p1 ?p2 where {
  service <tag:stardog:api:sample> {
    ?start a lubm:GraduateStudent .
    [] smp:size 1000
  }
  ?start ?p1 ?middle . ?middle ?p2 ?end .
  ?end a lubm:University
}
limit 100

Internals

There are several parameters which may influence how sampling is performed. Sampling occurs in one of two modes: random or fast. By default the random mode is used.

The sampling service is designed so it may be an order of magnitude faster than a full index scan with reservoir sampling. We can achieve this by just scanning a fraction of the actual dataset; the sampling service may skip over data-files which the stardog storage engine created. This means that the statistical quality of the sample may depend on how the indexed triples are distributed across the data-files and can be improved by tuning the default parameters.

The sampling service first selects a random subset from the underlying data files. The number of included data files can be controlled via the ratio parameter. This parameter specifies the probability with which each underlying file is included. The valid range for this parameter is (0.0, 1.0].

In the default random mode, every triple in the selected data-files is included in the sample with the probability of p(include) = sampleSize / totalTriples. The value of totalTriples here is an estimation of all triples matched by the enclosed triple pattern. This number is estimated using internal statistics, but a hint may be provided too. Alternatively in the fast mode, only the first k triples of each data file are included, such that the sum over all k is equal to the specified sample size. Here the quality of the sample may be bad, but it could still be sufficient for some applications.

Example:

prefix smp: <tag:stardog:api:sample:>
SELECT ?resource {
    SERVICE <tag:stardog:api:sample> {
          ?resource a ?resourceType .
          [] smp:size 100000 ;
             smp:ratio 0.75 ;
             smp:mode "fast".
}   }
}

Limitations

Only single triple patterns are supported inside service <tag:stardog:api:sample>; arbitrary Basic Graph Patterns (BGPs) are not supported at this time.

Parameter Table

Option name	Default	Description
size	required	Desired sample size to return
ratio	0.5	Ratio of underlying storage files to include. Value 1.0 includes all files
mode	`random`	`random` scans all entries, `fast` scans just the beginning of underlying files
total	0	Hint of total number of triples matched. Zero or negative number will make the service use internal statistics