Aggregate Functions
ZomboDB exposes nearly all of Elasticsearch's aggregates as type-checked SQL functions that return tables and discreet values, as opposed to json blobs.
In all cases, unless explicitly otherwise noted, the results returned from all of the below aggregate functions are MVCC-correct. This means that the functions only operate against records that are considered visible to the current transaction.
Arbitrary Aggregate Support
FUNCTION zdb.arbitrary_agg(
index regclass,
query zdbquery,
agg_json json)
RETURNS json
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
This function is a direct-path for executing any arbitrary aggregate search request that Elasticsearch supports.
The result is a json blob that can be processed in your application code or otherwise manipulated using Postgres json support functions.
Single-Value Aggregates
FUNCTION zdb.avg(
index regclass,
field text,
query zdbquery)
RETURNS numeric
A single-value metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.
FUNCTION zdb.cardinality(
index regclass,
field text,
query zdbquery)
RETURNS numeric
A single-value metrics aggregation that calculates an approximate count of distinct values. Values can be extracted either from specific fields in the document.
FUNCTION zdb.count(
index regclass,
query zdbquery)
RETURNS bigint
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html
Not an aggregate per se, this function is mapped to Elasticsearch's _count
endpoint and simply returns the number of
documents that match the provided query. The result is MVCC-correct.
FUNCTION zdb.raw_count(
index regclass,
query zdbquery)
RETURNS bigint SET zdb.ignore_visibility = true
Similar to zdb.count()
above, but it ignores MVCC visibility rules, and the result is the actual count of documents
matching the query, including deleted documents, documents from aborted transactions, old versions of documents from an
UPDATE statement, and new/updated docs from in-flight transactions.
Generally you'll want to use zdb.count()
instead.
FUNCTION zdb.max(
index regclass,
field text,
query zdbquery)
RETURNS numeric
A single-value metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.
FUNCTION zdb.min(
index regclass,
field text,
query zdbquery)
RETURNS numeric
A single-value metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.
FUNCTION zdb.missing(
index regclass,
field text,
query zdbquery)
RETURNS numeric
A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set).
FUNCTION zdb.sum(
index regclass,
field text,
query zdbquery)
RETURNS numeric
A single-value metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.
FUNCTION zdb.value_count(
index regclass,
field text,
query zdbquery)
RETURNS numeric
A single-value metrics aggregation that counts the number of values that are extracted from the aggregated documents. These values can be extracted either from specific fields in the documents.
Multi-Row/Column Aggregates
The following aggregates transform the results from Elasticsearch into "tables", and should all be queried as such. For example:
SELECT * FROM zdb.terms('idxproducts', 'tags', dsl.match_all());
FUNCTION zdb.adjacency_matrix(
index regclass,
labels text[],
filters zdbquery[])
RETURNS TABLE (
key text,
doc_count bigint)
A bucket aggregation returning a form of adjacency matrix. The request provides a collection of named filter expressions, similar to the filters aggregation request. Each bucket in the response represents a non-empty cell in the matrix of intersecting filters.
The labels
and filters
arguments must have the same bounds.
FUNCTION zdb.adjacency_matrix_2x2(
index regclass,
labels text[],
filters zdbquery[])
RETURNS TABLE (
"-" text,
"1" text,
"2" text)
This is a table-based version of zdb.adjacency_matrix()
that outputs a 2x2 matrix.
The labels
and filters
arguments must have the same bounds.
FUNCTION zdb.adjacency_matrix_3x3(
index regclass,
labels text[],
filters zdbquery[])
RETURNS TABLE (
"-" text,
"1" text,
"2" text,
"3" text)
This is a table-based version of zdb.adjacency_matrix()
that outputs a 3x3 matrix.
The labels
and filters
arguments must have the same bounds.
FUNCTION zdb.adjacency_matrix_4x4(
index regclass,
labels text[],
filters zdbquery[])
RETURNS TABLE (
"-" text,
"1" text,
"2" text,
"3" text,
"4" text)
This is a table-based version of zdb.adjacency_matrix()
that outputs a 4x4 matrix.
The labels
and filters
arguments must have the same bounds.
FUNCTION zdb.adjacency_matrix_5x5(
index regclass,
labels text[],
filters zdbquery[])
RETURNS TABLE (
"-" text,
"1" text,
"2" text,
"3" text,
"4" text,
"5" text)
This is a table-based version of zdb.adjacency_matrix()
that outputs a 5x5 matrix.
The labels
and filters
arguments must have the same bounds.
FUNCTION zdb.date_histogram(
index regclass,
field text,
query zdbquery,
"interval" text,
format text DEFAULT 'yyyy-MM-dd')
RETURNS TABLE (
key numeric,
key_as_string text,
doc_count bigint)
A multi-bucket aggregation similar to the histogram except it can only be applied on date values. Since dates are represented in Elasticsearch internally as long values, it is possible to use the normal histogram on dates as well, though accuracy will be compromised. The reason for this is in the fact that time based intervals are not fixed (think of leap years and on the number of days in a month). For this reason, we need special support for time based data. From a functionality perspective, this histogram supports the same features as the normal histogram. The main difference is that the interval can be specified by date/time expressions.
FUNCTION zdb.date_range(
index regclass,
field text,
query zdbquery,
date_ranges_array json)
RETURNS TABLE (
key text,
"from" numeric,
from_as_string timestamp with time zone,
"to" numeric,
to_as_string timestamp with time zone,
doc_count bigint)
A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the from and to response fields will be returned. Note that this aggregation includes the from value and excludes the to value for each range.
FUNCTION zdb.extended_stats(
index regclass,
field text,
query zdbquery,
sigma int DEFAULT 0)
RETURNS TABLE (
count bigint,
min numeric,
max numeric,
avg numeric,
sum numeric,
sum_of_squares numeric,
variance numeric,
stddev numeric,
stddev_upper numeric,
stddev_lower numeric)
A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.
FUNCTION zdb.filters(
index regclass,
labels text[],
filters zdbquery[])
RETURNS TABLE (
label text,
doc_count bigint)
Similar to zdb.count()
except multiple queries (filters) are supported.
The labels
and filters
arguments must have the same bounds.
FUNCTION zdb.histogram(
index regclass,
field text,
query zdbquery,
"interval" float8)
RETURNS TABLE (
key numeric,
doc_count bigint)
A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval 5 (in case of price it may represent $5).
FUNCTION zdb.ip_range(
index regclass,
field text,
query zdbquery,
ip_ranges_array json)
RETURNS TABLE (
key text,
"from" inet,
"to" inet,
doc_count bigint)
Just like the dedicated date range aggregation, there is also a dedicated range aggregation for IP typed fields.
FUNCTION zdb.matrix_stats(
index regclass,
fields text[],
query zdbquery)
RETURNS TABLE (
name text,
count bigint,
mean numeric,
variance numeric,
skewness numeric,
kurtosis numeric,
covariance json,
correlation json)
The matrix_stats aggregation is a numeric aggregation that computes various statistics over a set of document fields.
FUNCTION zdb.percentile_ranks(
index regclass,
field text,
query zdbquery,
"values" text DEFAULT '')
RETURNS TABLE (
percentile numeric,
value numeric)
A multi-value metrics aggregation that calculates one or more percentile ranks over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.
FUNCTION zdb.percentiles(
index regclass,
field text,
query zdbquery,
percents text DEFAULT '')
RETURNS TABLE (
percentile numeric,
value numeric)
A multi-value metrics aggregation that calculates one or more percentiles over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.
FUNCTION zdb.range(
index regclass,
field text,
query zdbquery,
ranges_array json)
RETURNS TABLE (
key text,
"from" numeric,
"to" numeric,
doc_count bigint)
A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and "bucket" the relevant/matching document. Note that this aggregation includes the from value and excludes the to value for each range.
FUNCTION zdb.significant_terms(
index regclass,
field text,
query zdbquery,
include text DEFAULT '.*'::text,
size_limit integer DEFAULT 2147483647,
min_doc_count integer DEFAULT 3)
RETURNS TABLE (
term text,
doc_count bigint,
score numeric,
bg_count bigint)
An aggregation that returns interesting or unusual occurrences of terms in a set.
FUNCTION zdb.significant_terms_two_level(
index regclass,
first_field text,
second_field text,
query zdbquery,
size bigint DEFAULT 0)
RETURNS TABLE (
first_term text,
second_term text,
doc_count bigint,
score numeric,
bg_count bigint,
doc_count_error_upper_bound bigint,
sum_other_doc_count bigint)
An adaption of zdb.significant_terms()
where it uses zdb.terms()
for the terms from first_field
and
zdb.significant_terms()
for the terms for second_field
.
FUNCTION zdb.significant_text(
index regclass,
field text,
query zdbquery,
sample_size int DEFAULT 0,
filter_duplicate_text boolean DEFAULT true)
RETURNS TABLE (
term text,
doc_count bigint,
score numeric,
bg_count bigint)
An aggregation that returns interesting or unusual occurrences of free-text terms in a set. It is like the significant terms aggregation but differs in that:
- It is specifically designed for use on type text fields
- It does not require field data or doc-values
- It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.
This aggregate is only supported by Elasticsearch 6+ clusters.
FUNCTION zdb.suggest_terms(
index regclass,
field_name text,
suggest test,
query zdbquery,
) RETURNS TABLE (
term text,
offset bigint,
length bigint,
suggestion text,
score double precision,
frequency bigint,
)
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html
While not necessarily an aggregate, zdb.terms_suggester
will tokenize an input textual suggestion string and provide
suggestions for each token that contains suggestions.
Useful for correcting misspellings -- ie, "Did you mean?"-style queries
FUNCTION zdb.stats(
index regclass,
field text,
query zdbquery)
RETURNS TABLE (
count bigint,
min numeric,
max numeric,
avg numeric,
sum numeric)
A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.
FUNCTION zdb.tally(
index regclass,
field_name text,
[ is_nested bool],
stem text,
query ZDBQuery,
size_limit integer DEFAULT '2147483647',
order_by TermsOrderBy DEFAULT 'count',
shard_size integer DEFAULT '2147483647',
count_nulls bool DEFAULT 'true'
) RETURNS TABLE (term text, count bigint)
index
: The name of the a ZomboDB index to query\
field)_name
: The name of a field from which to derive
terms\
is_nested
: Optional argument to indicate that the terms should only come from matching nested object
sub-elements. Default is false
\
stem
: a Regular expression by which to filter returned terms, or a date interval if
the specified fieldname
is a date or timestamp\
query
: a ZomboDB query\
size_limit
: maximum number of terms to
return. A NULL value means "all terms". order_by
: how to sort the results. one of 'count'
(descending), 'term'
,
'reverse_count'
(ascending), 'reverse_term'
\
shard_size
: optional parameter that tells Elasticsearch how many
terms to return from each shard. Default is zero, which means all terms\
count_nulls
: should a row containing the
count of NULL (ie, missing) values be included in the results?
This function provides direct access to Elasticsearch's
terms aggregate
and cannot be used with fields of type fulltext
. The results are MVCC-safe. Returned terms are forced to upper-case.
If a stem is not specified, no results will be returned. To match all terms, use a stem of ^.*
The order_by
parameter defaults to count
, which sorts documents by the occurrence count, largest to smallest. A
value of reverse_count
will sort them smallest to largest.
If the specified fieldname
is a date/timestamp, then one of the following values are allowed for aggregating values
into histogram buckets of the specified interval: year, quarter, month, week, day, hour, minute, second
. In all cases,
an optional offset value can be specified. For example: week:-1d
will offset the dates by one day so that the first
day of the week will be considered to be Sunday (instead of the default of Monday).
Example:
SELECT * FROM zdb.tally('products', 'keywords', '^.*', 'base* or distance', 5000, 'term');
term | count
>---------------+-------
BASEBALL | 1
COMMUNICATION | 1
PRIMITIVE | 1
SPORTS | 1
THOMAS EDISON | 1
Regarding the is_nested
argument, consider data like this:
row #1: contributor_data=[
{ "name": "John Doe", "age": 42, "location": "TX", "tags": ["active"] },
{ "name": "Jane Doe", "age": 36, "location": "TX", "tags": ["nice"] }
>]
>
>row #2: contributor_data=[
{ "name": "Bob Dole", "age": 92, "location": "KS", "tags": ["nice", "politician"] },
{ "name": "Elizabth Dole", "age": 79, "location": "KS", "tags": ["nice"] }
>]
And a query where is_nested
is false:
SELECT * FROM zdb.tally('idxproducts', 'contributor_data.name', false, '^.*', 'contributor_data.location:TX AND contributor_data.tags:nice', 5000, 'term');
returns:
term | count
----------+-------
JANE DOE | 1
JOHN DOE | 1
(2 rows)
Whereas, if
is_nested
is true, only "JANE DOE" is returned because it's the only subelement ofcontributor_data
that matched the query:
SELECT * FROM zdb.tally('idxproducts', 'contributor_data.name', true, '^.*', 'contributor_data.location:TX WITH contributor_data.tags:nice', 5000, 'term');
returns:
term | count
----------+-------
JANE DOE | 1
(1 row)
CREATE TYPE TermsOrderBy AS ENUM (
'count',
'term',
'reverse_count',
'reverse_term');
FUNCTION zdb.terms(
index regclass,
field text,
query zdbquery,
size_limit bigint DEFAULT 0,
order_by TermsOrderBy DEFAULT 'count')
RETURNS TABLE (
term text,
doc_count bigint)
A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.
Note that the order_by
argument defines how to sort the results:
'count'
(ascending),'reverse_count'
(ascending),'term'
(ascending),'reverse_term'
(descending)
FUNCTION zdb.terms_array(
index regclass,
field text,
query zdbquery,
size_limit bigint DEFAULT 0,
order_by TermsOrderBy DEFAULT 'count')
RETURNS text[]
A version of zdb.terms()
that instead returns only the terms as a text[]
.
FUNCTION zdb.terms_two_level(
index regclass,
first_field text,
second_field text,
query zdbquery,
order_by TwoLevelTermsOrderBy DEFAULT 'count',
size bigint DEFAULT 0)
RETURNS TABLE (
first_term text,
second_term text,
doc_count bigint)
Similar to zdb.significant_terms_two_level()
, this is an adaption of zdb.terms()
to provide a two-level nested
hierarchy of terms from two different fields.
FUNCTION zdb.top_hits(
index regclass,
fields text[],
query zdbquery,
size int)
RETURNS TABLE (
ctid tid,
score float4,
source json)
A top_hits metric aggregator keeps track of the most relevant document being aggregated. This aggregator is intended to be used as a sub aggregator, so that the top matching documents can be aggregated per bucket.
FUNCTION zdb.top_hits_with_id(
index regclass,
fields text[],
query zdbquery,
size int)
RETURNS TABLE (
_id text,
score float4,
source json)
Similar to zdb.top_hits()
above, but returns the Elasticsearch document _id
value for each hit rather than the
corresponding Postgres ctid
value.