This page discusses the Sampling Service, which allows for random sampling without replacement, from the set of result triples matched by a SPARQL pattern.
<details open markdown="block"> <summary> Page Contents </summary> 1. TOC </details>The Sampling Service allows for random sampling from the set of results matched by a particular SPARQL triple pattern. Sampling without replacement is useful for training and testing ML models, data exploration and visualization.
An example of using the sample service follows below.
?resource a ?resourceType in the following example.smp:size.prefix smp: <tag:stardog:api:sample:>
SELECT ?resource {
service <tag:stardog:api:sample> {
?resource a ?resourceType .
[] smp:size 10000 ;
}
}
There are some preconditions to using the sampling service in a sensible way.
stardog-admin db optimize, or data was imported during db creationBest results may be obtained when the query results are unlikely to change: Whether a full triple pattern is used
or just a sample. For example, just getting the distinct outgoing predicates for :Product instances. We assume the
number of distinct predicates is low, so we just need a good sample of :Product instances to start from.
prefix smp: <tag:stardog:api:sample:>
SELECT DISTINCT ?predicate {
service <tag:stardog:api:sample> {
?resource a :Product .
[] smp:size 10000 ;
}
?resource ?predicate []
}
Another Example on a lubm generated dataset
prefix lubm: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
prefix smp: <tag:stardog:api:sample:>
select distinct ?p1 ?p2 where {
service <tag:stardog:api:sample> {
?start a lubm:GraduateStudent .
[] smp:size 1000
}
?start ?p1 ?middle . ?middle ?p2 ?end .
?end a lubm:University
}
limit 100
There are several parameters which may influence how sampling is performed. Sampling occurs
in one of two modes: random or fast. By default the random mode is used.
The sampling service is designed so it may be an order of magnitude faster than a full index scan with reservoir sampling. We can achieve this by just scanning a fraction of the actual dataset; the sampling service may skip over data-files which the stardog storage engine created. This means that the statistical quality of the sample may depend on how the indexed triples are distributed across the data-files and can be improved by tuning the default parameters.
The sampling service first selects a random subset from the underlying data files.
The number of included data files can be controlled via the ratio parameter. This parameter
specifies the probability with which each underlying file is included.
The valid range for this parameter is (0.0, 1.0].
In the default random mode, every triple in the selected data-files
is included in the sample with the probability of p(include) = sampleSize / totalTriples.
The value of totalTriples here is an estimation of all triples matched by the enclosed triple pattern.
This number is estimated using internal statistics, but a hint may be provided too.
Alternatively in the fast mode, only the first k triples of each data file are included, such that
the sum over all k is equal to the specified sample size. Here the quality of the sample may be bad, but
it could still be sufficient for some applications.
Example:
prefix smp: <tag:stardog:api:sample:>
SELECT ?resource {
SERVICE <tag:stardog:api:sample> {
?resource a ?resourceType .
[] smp:size 100000 ;
smp:ratio 0.75 ;
smp:mode "fast".
} }
}
Only single triple patterns are supported inside service <tag:stardog:api:sample>;
arbitrary Basic Graph Patterns (BGPs) are not supported at this time.
| Option name | Default | Description |
|---|---|---|
| size | required | Desired sample size to return |
| ratio | 0.5 | Ratio of underlying storage files to include. Value 1.0 includes all files |
| mode | random | random scans all entries, fast scans just the beginning of underlying files |
| total | 0 | Hint of total number of triples matched. Zero or negative number will make the service use internal statistics |