Skip to content

ES Agg pushdown v2 #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 21, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions pg_es_fdw/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,12 @@ def _handle_aggregation_response(self, query, response, aggs, group_clauses):
result = {}

for agg_name in aggs:
if agg_name == "count.*":
# COUNT(*) is a special case, since it doesn't have a
# corresponding aggregation primitice in ES
result[agg_name] = response["hits"]["total"]["value"]
continue

result[agg_name] = response["aggregations"][agg_name]["value"]
yield result
else:
Expand All @@ -331,6 +337,12 @@ def _handle_aggregation_response(self, query, response, aggs, group_clauses):

if aggs is not None:
for agg_name in aggs:
if agg_name == "count.*":
# In general case with GROUP BY clauses COUNT(*)
# is taken from the bucket's doc_count field
result[agg_name] = bucket["doc_count"]
continue

result[agg_name] = bucket[agg_name]["value"]

yield result
Expand Down
40 changes: 25 additions & 15 deletions pg_es_fdw/_es_query.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
"min": "min",
"sum": "sum",
"count": "value_count",
"count.*": None # not mapped to a particular function
}


Expand All @@ -43,7 +44,7 @@ def _base_qual_to_es(col, op, value, column_map=None):
return {"bool": {"must_not": {"term": {col: value}}}}

if op == "~~":
return {"match": {col: value.replace("%", "*")}}
return {"wildcard": {col: value.replace("%", "*")}}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would handle underscores (single-character match) and escapes (e.g. \% matches %), so something like:

  • % (when not preceded by a single \) becomes *
  • _ (when not preceded by a single \) becomes ?
  • \% becomes % (since it has no special meaning in ES wildcards)
  • \_ becomes _ (same)

It does become complex though. I'd be happy with us returning false positives (since PG will recheck and filter them out) but I don't know how to process the input LIKE pattern to get to a pattern that definitely over-matches instead of possibly under-matching (e.g. right now the x LIKE 'a_ple' won't match apple AFAIU)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think this should do it, let me add a couple of tests in the engine repo.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think of it, I need to handle case like \\%(becomes \*), and \\\% (becomes \%?).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • \\% in PG patterns means "backslash + any characters", so it'd become \\*
  • \\\% means "backslash + %", so it'd become \\% (since % has no special meaning)

Copy link
Author

@gruuya gruuya Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, this should now be covered as well. Here are the corresponding tests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test / check that this escapes special ES pattern characters, e.g.:

  • * becomes \*
  • ? becomes \?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, added a fix and a test for these cases too.


# For unknown operators, get everything
return {"match_all": {}}
Expand Down Expand Up @@ -82,6 +83,18 @@ def quals_to_es(
"""Convert a list of Multicorn quals to an ElasticSearch query"""
ignore_columns = ignore_columns or []

query = {
"query": {
"bool": {
"must": [
_qual_to_es(q, column_map)
for q in quals
if q.field_name not in ignore_columns
]
}
}
}

# Aggregation/grouping queries
if aggs is not None:
aggs_query = {
Expand All @@ -91,10 +104,18 @@ def quals_to_es(
}
}
for agg_name, agg_props in aggs.items()
if agg_name != "count.*"
}

if group_clauses is None:
return {"aggs": aggs_query}
if "count.*" in aggs:
# There is no particular COUNT(*) equivalent in ES, instead
# for plain aggregations (e.g. no grouping statements), we need
# to enable the track_total_hits option in order to get an
# accuate number of matched docs.
query["track_total_hits"] = True

query["aggs"] = aggs_query

if group_clauses is not None:
group_query = {
Expand All @@ -111,17 +132,6 @@ def quals_to_es(
if aggs is not None:
group_query["group_buckets"]["aggregations"] = aggs_query

return {"aggs": group_query}
query["aggs"] = group_query

# Regular query
return {
"query": {
"bool": {
"must": [
_qual_to_es(q, column_map)
for q in quals
if q.field_name not in ignore_columns
]
}
}
}
return query