Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Add MATCHING_ROW and VALUE_AT #106152

Closed
wants to merge 1 commit into from
Closed

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 9, 2024

This adds two functions: MATCHING_ROW and VALUE_AT. MATCHING_ROW takes pairs of values and the second one must always be a constant and matches the variable value to offset in the constant value. It looks like:

  FROM inventory
| EVAL r=MATCHING_ROW(size, ["XS", "S, "M", "L", "XL"])

That'd generate these hypothetical results

     Cool-Shirt |  20.00 | XL | 4
Expensive-Shirt | 120.00 | XL | 4
     Cool-Shirt |  20.00 |  S | 1

VALUE_AT takes an index and an array of values and returns the value at that offset. So:

  FROM employees
| EVAL languages_word = VALUE_AT(languages, ["zero", "one", "two", "three", "four"])
| SORT emp_no
| LIMIT 4
| KEEP first_name, languages, languages_word

Would make:

Georgi             |                 2 | two
Bezalel            |                 5 | null
Parto              |                 4 | four
Chirstian          |                 5 | null

You can combine them together:

  FROM inventory
| EVAL r=MATCHING_ROW(size, ["XS", "S, "M", "L", "XL"])
| EVAL avg_price=VALUE_AT(r, [null, 20.00, null, null, 70.00])
| DROP r
| WHERE price > avg_price

Which would yield:

Expensive-Shirt | 120.00 | XL | 70.00

If that looks familiar then you've been paying close attention! It's another join strategy, specifically one that makes sense when the data you are joining against is small. Which is precisely what should happen for the INLINESTATS command that we implemented in the grammar a long time ago but never implemented in the engine.

This adds two functions: `MATCHING_ROW` and `VALUE_AT`. `MATCHING_ROW`
takes pairs of values and the second one must always be a constant and
matches the variable value to offset in the constant value. It looks
like:
```
  FROM inventory
| EVAL r=MATCHING_ROW(size, ["XS", "S, "M", "L", "XL"])
```

That'd generate these hypothetical results
```
     Cool-Shirt |  20.00 | XL | 4
Expensive-Shirt | 120.00 | XL | 4
     Cool-Shirt |  20.00 |  S | 1
```

`VALUE_AT` takes an index and an array of values and returns the value
at that offset. So:
```
  FROM employees
| EVAL languages_word = VALUE_AT(languages, ["zero", "one", "two", "three", "four"])
| SORT emp_no
| LIMIT 4
| KEEP first_name, languages, languages_word
```

Would make:
```
Georgi             |                 2 | two
Bezalel            |                 5 | null
Parto              |                 4 | four
Chirstian          |                 5 | null
```

You can combine them together:
```
  FROM inventory
| EVAL r=MATCHING_ROW(size, ["XS", "S, "M", "L", "XL"])
| EVAL avg_price=VALUE_AT(r, [null, 20.00, null, null, 70.00])
| DROP r
| WHERE price > avg_price
```

Which would yield:
```
Expensive-Shirt | 120.00 | XL | 70.00
```

If *that* looks familiar then you've been paying close attention! It's
another join strategy, specifically one that makes sense when the data
you are joining against is small. Which is precisely what should happen
for the `INLINESTATS` command that we implemented in the grammar a long
time ago but never implemented in the engine.
Copy link
Member Author

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A guide for those looking - MATCHING_ROW is presently only implemented for a few types, but it's designed to delegate to BlockHash which already has code turn values into ordinals and resolving values to those ordinals. It's designed to solve the harder problem of turning streams of blocks into ordinals. I just needed to add an ability to lookup values instead of add them.

VALUE_AT abuses our syntax for multivalue fields to get them parsed as arrays. I think convert them to Blocks when building the executor factory. That makes it easy to copy values.

Now! Problems:

  1. I'm abusing the multivalue parsing syntax.
  2. I don't perform memory tracking on the blocks.
  3. I'd like to be able to push blocks in those parameters rather than List - that'd save a lot of memory.
  4. Obviously, I've not implemented INLINESTATS, just the data-node side of it.
  5. For this to work for INLINESTATS the MATCHING_ROW function needs to match all values of all columns - so MATCHING_ROW([1, 2], [1, 2, 3]) will returning [0, 1] - but this will get multiplicative when combining more than one field. How do we make sure not to make huge Blocks?
  6. Do we want to expose these to people as functions or hide them as details of the INLINESTATS command? I could hide them behind a pragma for now so we don't have to make a choice. I do want to test them as individual functions kind of like I've done here.

@nik9000
Copy link
Member Author

nik9000 commented Mar 11, 2024

5. For this to work for INLINESTATS the MATCHING_ROW function needs to match all values of all columns - so MATCHING_ROW([1, 2], [1, 2, 3]) will returning [0, 1] - but this will get multiplicative when combining more than one field. How do we make sure not to make huge Blocks?

I think this isn't true. I think, at least for now, we're better off doing our standard stuff and only supporting single-valued fields for MATCHING_ROW and the INLINESTATS implementation can add an MV_EXPAND operation. Those are free for single-valued fields and they protect against the combinatorial explosion.

@nik9000
Copy link
Member Author

nik9000 commented Apr 30, 2024

Replaced by the hash lookup and column lookup operators I've recently added.

@nik9000 nik9000 closed this Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants