ESQL: Add MATCHING_ROW and VALUE_AT #106152

nik9000 · 2024-03-09T20:19:23Z

This adds two functions: MATCHING_ROW and VALUE_AT. MATCHING_ROW takes pairs of values and the second one must always be a constant and matches the variable value to offset in the constant value. It looks like:

  FROM inventory
| EVAL r=MATCHING_ROW(size, ["XS", "S, "M", "L", "XL"])

That'd generate these hypothetical results

     Cool-Shirt |  20.00 | XL | 4
Expensive-Shirt | 120.00 | XL | 4
     Cool-Shirt |  20.00 |  S | 1

VALUE_AT takes an index and an array of values and returns the value at that offset. So:

  FROM employees
| EVAL languages_word = VALUE_AT(languages, ["zero", "one", "two", "three", "four"])
| SORT emp_no
| LIMIT 4
| KEEP first_name, languages, languages_word

Would make:

Georgi             |                 2 | two
Bezalel            |                 5 | null
Parto              |                 4 | four
Chirstian          |                 5 | null

You can combine them together:

  FROM inventory
| EVAL r=MATCHING_ROW(size, ["XS", "S, "M", "L", "XL"])
| EVAL avg_price=VALUE_AT(r, [null, 20.00, null, null, 70.00])
| DROP r
| WHERE price > avg_price

Which would yield:

Expensive-Shirt | 120.00 | XL | 70.00

If that looks familiar then you've been paying close attention! It's another join strategy, specifically one that makes sense when the data you are joining against is small. Which is precisely what should happen for the INLINESTATS command that we implemented in the grammar a long time ago but never implemented in the engine.

This adds two functions: `MATCHING_ROW` and `VALUE_AT`. `MATCHING_ROW` takes pairs of values and the second one must always be a constant and matches the variable value to offset in the constant value. It looks like: ``` FROM inventory | EVAL r=MATCHING_ROW(size, ["XS", "S, "M", "L", "XL"]) ``` That'd generate these hypothetical results ``` Cool-Shirt | 20.00 | XL | 4 Expensive-Shirt | 120.00 | XL | 4 Cool-Shirt | 20.00 | S | 1 ``` `VALUE_AT` takes an index and an array of values and returns the value at that offset. So: ``` FROM employees | EVAL languages_word = VALUE_AT(languages, ["zero", "one", "two", "three", "four"]) | SORT emp_no | LIMIT 4 | KEEP first_name, languages, languages_word ``` Would make: ``` Georgi | 2 | two Bezalel | 5 | null Parto | 4 | four Chirstian | 5 | null ``` You can combine them together: ``` FROM inventory | EVAL r=MATCHING_ROW(size, ["XS", "S, "M", "L", "XL"]) | EVAL avg_price=VALUE_AT(r, [null, 20.00, null, null, 70.00]) | DROP r | WHERE price > avg_price ``` Which would yield: ``` Expensive-Shirt | 120.00 | XL | 70.00 ``` If *that* looks familiar then you've been paying close attention! It's another join strategy, specifically one that makes sense when the data you are joining against is small. Which is precisely what should happen for the `INLINESTATS` command that we implemented in the grammar a long time ago but never implemented in the engine.

nik9000

A guide for those looking - MATCHING_ROW is presently only implemented for a few types, but it's designed to delegate to BlockHash which already has code turn values into ordinals and resolving values to those ordinals. It's designed to solve the harder problem of turning streams of blocks into ordinals. I just needed to add an ability to lookup values instead of add them.

VALUE_AT abuses our syntax for multivalue fields to get them parsed as arrays. I think convert them to Blocks when building the executor factory. That makes it easy to copy values.

Now! Problems:

I'm abusing the multivalue parsing syntax.
I don't perform memory tracking on the blocks.
I'd like to be able to push blocks in those parameters rather than List - that'd save a lot of memory.
Obviously, I've not implemented INLINESTATS, just the data-node side of it.
For this to work for INLINESTATS the MATCHING_ROW function needs to match all values of all columns - so MATCHING_ROW([1, 2], [1, 2, 3]) will returning [0, 1] - but this will get multiplicative when combining more than one field. How do we make sure not to make huge Blocks?
Do we want to expose these to people as functions or hide them as details of the INLINESTATS command? I could hide them behind a pragma for now so we don't have to make a choice. I do want to test them as individual functions kind of like I've done here.

nik9000 · 2024-03-11T15:54:12Z

5. For this to work for INLINESTATS the MATCHING_ROW function needs to match all values of all columns - so MATCHING_ROW([1, 2], [1, 2, 3]) will returning [0, 1] - but this will get multiplicative when combining more than one field. How do we make sure not to make huge Blocks?

I think this isn't true. I think, at least for now, we're better off doing our standard stuff and only supporting single-valued fields for MATCHING_ROW and the INLINESTATS implementation can add an MV_EXPAND operation. Those are free for single-valued fields and they protect against the combinatorial explosion.

nik9000 · 2024-04-30T16:12:41Z

Replaced by the hash lookup and column lookup operators I've recently added.

elasticsearchmachine added the v8.14.0 label Mar 9, 2024

nik9000 commented Mar 9, 2024

View reviewed changes

nik9000 mentioned this pull request Mar 19, 2024

Optimize request fetching for filters and filter aggs elastic/kibana#136796

Open

nik9000 mentioned this pull request Apr 10, 2024

ESQL: Support provided table for enrich #107306

Closed

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

nik9000 removed the v8.15.0 label Apr 30, 2024

nik9000 closed this Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Add MATCHING_ROW and VALUE_AT #106152

ESQL: Add MATCHING_ROW and VALUE_AT #106152

nik9000 commented Mar 9, 2024

nik9000 left a comment

nik9000 commented Mar 11, 2024

nik9000 commented Apr 30, 2024

ESQL: Add MATCHING_ROW and VALUE_AT #106152

ESQL: Add MATCHING_ROW and VALUE_AT #106152

Conversation

nik9000 commented Mar 9, 2024

nik9000 left a comment

Choose a reason for hiding this comment

nik9000 commented Mar 11, 2024

nik9000 commented Apr 30, 2024