sql: enable indexing and ordering on arrays of orderable and indexable types #48045

rohany · 2020-04-26T03:41:56Z

This PR enables arrays to be ordered and indexed by
introducing an ordered key encoding for arrays.
Once this exists, the rest of the SQL infrastructure
is ready to handle indexing and ordering on arrays.

To encode an array of elements ARRAY[a, b],
we create the following encoding.

Let AM = a marker byte for arrays and let AT be a terminator byte.

enc(ARRAY[a, b]) = [AM, enc(a), enc(b), AT]

The key is that the terminator is less than the element marker.
This allows for the "prefix matching" style comparison that
arrays support.

Release note (sql change): This PR adds support for indexing
and ordering of arrays of indexable and orderable inner types.

cockroach-teamcity · 2020-04-26T03:42:03Z

This change is

blathers-crl · 2020-04-26T03:53:18Z

❌ The GitHub CI (Cockroach) build has failed on 7811ade7.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

rytaft

Reviewed 2 of 11 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @otan, and @rohany)

pkg/sql/opt/optbuilder/testdata/orderby, line 391 at r1 (raw file):

error (42P01): no data source matches prefix: a

build

Can you also add a test here for arrays of type json?

rohany · 2020-04-26T15:57:34Z

We don't currently support arrays of json (#23468)

jordanlewis

Nice, this will be a great improvement!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @otan, and @rohany)

pkg/sql/logictest/testdata/logic_test/array, line 1352 at r1 (raw file):

# TODO (rohany): We have always done this for array comparisons, so I think
#  it would be a breaking change + opposite with our other null behavior to
#  change it suddenly...

I agree. Can you check that Postgres changes the null sorting behavior within arrays when you change NULLS FIRST / NULLS LAST?

pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

// encodeArrayKey generates an ordered key encoding of an array.
// The encoding format for an array [a, b] is as follows:
// [arrayMarker, elementMarker, enc(a), elementMarker, enc(b), terminator]

I don't understand why we need the markers. Wouldn't the sorting work just as well if we simply did [enc(a), enc(b)] in your example?

It looks like arrays are supposed to sort lexicographically - it's element by element comparison, and longer arrays lose. But I think that that would already be the case with normal key encoding:

[enc(a), enc(b)] would be less than [enc(a), enc(b), anything else] just by the normal lexicographical sorting that keys use.

Does NULL somehow break this?

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @otan, and @rohany)

pkg/sql/logictest/testdata/logic_test/array, line 1352 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I agree. Can you check that Postgres changes the null sorting behavior within arrays when you change NULLS FIRST / NULLS LAST?

It doesn't seem to:

rohany=# create table t (x int[]);
CREATE TABLE
rohany=# insert into t values (array[1]), (array[null::int]);
INSERT 0 2
rohany=# select * from t order by x
rohany-# ;
   x
--------
 {1}
 {NULL}
(2 rows)

rohany=# select * from t order by x NULLS first;
   x
--------
 {1}
 {NULL}
(2 rows)

rohany=# select * from t order by x NULLS last;
   x
--------
 {1}
 {NULL}
(2 rows)

pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I don't understand why we need the markers. Wouldn't the sorting work just as well if we simply did [enc(a), enc(b)] in your example?

It looks like arrays are supposed to sort lexicographically - it's element by element comparison, and longer arrays lose. But I think that that would already be the case with normal key encoding:

[enc(a), enc(b)] would be less than [enc(a), enc(b), anything else] just by the normal lexicographical sorting that keys use.

Does NULL somehow break this?

Yeah, the issue is NULLs. Nulls are encoded as the byte 0x00. If we don't have the markers, then null would conflict with the array terminator. But we can't just add a marker for NULL only, because then it wouldn't sort right against non-null elements. Am I missing something straightforward here?

blathers-crl · 2020-04-26T19:21:18Z

❌ The GitHub CI (Cockroach) build has failed on e9ad56d9.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rohany)

pkg/sql/logictest/testdata/logic_test/array, line 1352 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

It doesn't seem to:

rohany=# create table t (x int[]);
CREATE TABLE
rohany=# insert into t values (array[1]), (array[null::int]);
INSERT 0 2
rohany=# select * from t order by x
rohany-# ;
   x
--------
 {1}
 {NULL}
(2 rows)

rohany=# select * from t order by x NULLS first;
   x
--------
 {1}
 {NULL}
(2 rows)

rohany=# select * from t order by x NULLS last;
   x
--------
 {1}
 {NULL}
(2 rows)

Ok. That's a little bit unfortunate as it might mean that we're supposed to do this no matter what. But I guess I agree that we cant change our behavior now anyway.

pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

Yeah, the issue is NULLs. Nulls are encoded as the byte 0x00. If we don't have the markers, then null would conflict with the array terminator. But we can't just add a marker for NULL only, because then it wouldn't sort right against non-null elements. Am I missing something straightforward here?

I'm not sure yet. By the way how would this interact with multi-column keys? Maybe my lexicographic ordering comment was wrong, since of course keys can be longer than just a single column. How does the ordering distinguish between short and long columns when there could be a second column afterward?

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)

pkg/sql/logictest/testdata/logic_test/array, line 1352 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Ok. That's a little bit unfortunate as it might mean that we're supposed to do this no matter what. But I guess I agree that we cant change our behavior now anyway.

We'd have to do something a little wonky like switch the null encoding direction when encoding arrays to support this

pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I'm not sure yet. By the way how would this interact with multi-column keys? Maybe my lexicographic ordering comment was wrong, since of course keys can be longer than just a single column. How does the ordering distinguish between short and long columns when there could be a second column afterward?

there is an array terminator byte, so thats how we know. I'll add some tests for it though.

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rohany)

pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

there is an array terminator byte, so thats how we know. I'll add some tests for it though.

I see.

Could we encode null as a byte that's greater than 0x00, but less than all other values? It looks like we could use 0x01, which is reserved for encodedNotNull which is never actually used in a key.

rohany · 2020-04-28T01:26:20Z

PTAL, updated the encoding strategy to not use the extra marker bytes between elements, and instead use a different null encoding within the array

blathers-crl · 2020-04-28T01:48:25Z

❌ The GitHub CI (Cockroach) build has failed on 5ac7528b.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl · 2020-04-28T02:21:49Z

❌ The GitHub CI (Cockroach) build has failed on 2ee17393.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rohany)

pkg/sql/logictest/testdata/logic_test/array, line 1319 at r2 (raw file):


subtest array_indexes

Other testing ideas:

arrays w/ composite types (decimal)
arrays w/ strings? cuz their encoding is escape based - if we changed our decoder to try to search for an endpoint it would fail
more than 1 array in a single index
group by, distinct, join on array (or hmm we already added those, right?)
foreign key to array
lookup join to array? i'm not sure if these are necessary... just thinking

can you add execbuilder tests for the array scans that show the spans that get generated for array range queries?

pkg/util/encoding/encoding.go, line 1294 at r2 (raw file):

	TimeTZ       Type = 19
	Geo          Type = 20
	ArrayKeyAsc  Type = 21 // Array key encoding

But what about the existing Array byte? I still don't understand why we can't reuse it.

Also, how come most of these other fields don't have Desc equivalents? What makes array/string/etc special?

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)

pkg/sql/logictest/testdata/logic_test/array, line 1319 at r2 (raw file):

arrays w/ composite types (decimal)

already included

arrays w/ strings? cuz their encoding is escape based - if we changed our decoder to try to search for an endpoint it would fail

Done, added bytes too.

more than 1 array in a single index

Done

group by, distinct, join on array (or hmm we already added those, right?)

groupby and distinct are tested, added join tests

foreign key to array

Done -- this helped to catch a bug!

lookup join to array? i'm not sure if these are necessary... just thinking

Done

can you add execbuilder tests for the array scans that show the spans that get generated for array range queries?

done

pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I see.

Could we encode null as a byte that's greater than 0x00, but less than all other values? It looks like we could use 0x01, which is reserved for encodedNotNull which is never actually used in a key.

Done.

pkg/util/encoding/encoding.go, line 1294 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

But what about the existing Array byte? I still don't understand why we can't reuse it.

Also, how come most of these other fields don't have Desc equivalents? What makes array/string/etc special?

RE the existing Array byte -- I wanted to get rid of this, but there are some places where we lose information when we do that. For example, when attempting to pretty print an array key, we need to know whether it is a key or value encoding of an array, where if we just use the existing Array type we won't have that information.

RE the others, I think the decoding logic can just figure out what to do based on the encodings themselves. If anything, i think that these type bytes are mostly exported views of the internal marker bytes.

blathers-crl · 2020-04-29T01:21:52Z

❌ The GitHub CI (Cockroach) build has failed on 1196f7f5.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl · 2020-04-29T01:36:28Z

❌ The GitHub CI (Cockroach) build has failed on 1196f7f5.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl · 2020-04-29T02:34:30Z

❌ The GitHub CI (Cockroach) build has failed on c6cc1315.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

awoods187 · 2020-04-30T21:33:40Z

I'm excited about this work!

jordanlewis

Ok looks good but let's just get rid of the desc tag and type everywhere - we can do as the other types do when they need descending encoding.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @otan, and @rohany)

pkg/sql/rowexec/version_history.txt, line 114 at r3 (raw file):

- Version: 29 (MinAcceptedVersion: 29)
    - The hashing of DArray datums was changed and is incompatible with the
      existing method of hasing DArrays.

nit: s/hasing/hashing/

pkg/sql/sqlbase/column_type_encoding_test.go, line 213 at r3 (raw file):

	// Also run the property on arrays possibly containing NULL values.
	// The random generator in the property above does not generate NULLs.
	properties.Property("order-preserving-arrays", prop.ForAll(

Heh how do you like using this Gopter thing

pkg/util/encoding/encoding.go, line 88 at r3 (raw file):

	// Markers and terminators for key encoding Datum arrays in sorted order.
	arrayKeyMarker                    = geoMarker + 1
	arrayKeyDescendingMarker          = arrayKeyMarker + 1

I think that we don't need these descending markers. See discussion here: https://cockroachlabs.slack.com/archives/C0HM2DZ34/p1588718172047500?thread_ts=1588717530.044900&cid=C0HM2DZ34

The tl;dr is that the descending thing isn't necessary since we no longer have a self-describing key format. The fact that bytesDesc exists was an early attempt and isn't strictly necessary.

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)

pkg/sql/sqlbase/column_type_encoding_test.go, line 213 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Heh how do you like using this Gopter thing

lol its pretty nice -- tho it's definitely super wonky to start with.

pkg/util/encoding/encoding.go, line 88 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think that we don't need these descending markers. See discussion here: https://cockroachlabs.slack.com/archives/C0HM2DZ34/p1588718172047500?thread_ts=1588717530.044900&cid=C0HM2DZ34

The tl;dr is that the descending thing isn't necessary since we no longer have a self-describing key format. The fact that bytesDesc exists was an early attempt and isn't strictly necessary.

added a comment about this

blathers-crl · 2020-05-06T02:00:45Z

❌ The GitHub CI (Cockroach) build has failed on 08ac641a.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl · 2020-05-06T03:00:09Z

❌ The GitHub CI (Cockroach) build has failed on fdae19e4.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

…e types Fixes cockroachdb#17154. Fixes cockroachdb#35707. This PR enables arrays to be ordered and indexed by introducing an ordered key encoding for arrays. Once this exists, the rest of the SQL infrastructure is ready to handle indexing and ordering on arrays. To encode an array of elements `ARRAY[a, b]`, we create the following encoding. Let `AM` = a marker byte for arrays and let `AT` be a terminator byte. `enc(ARRAY[a, b]) = [AM, enc(a), enc(b), AT]` The key is that the terminator is less than the element marker. This allows for the "prefix matching" style comparison that arrays support. Release note (sql change): This PR adds support for indexing and ordering of arrays of indexable and orderable inner types.

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rohany)

pkg/sql/sqlbase/column_type_encoding.go, line 927 at r4 (raw file):

// encodeArrayKey generates an ordered key encoding of an array.
// The encoding format for an array [a, b] is as follows:
// [arrayMarker, enc(a), enc(b), terminator].

I'm afraid this is still going to cause problems because of string encoding.

In fact, I'm confused how it didn't already break your string test cases. Encoded strings are terminated with the sequence \x00\x01. But our terminator is also \x00, isn't it? What am I missing...

rohany · 2020-05-07T19:40:43Z

The important thing is the context in what is attempting to be decoded. When we are decoding a string, we know we are looking for \x00\x01 as the terminator. once we return from decoding a string, we go back to the context of an array, and interpret \x00 as a terminator. Additionally, the string encodings escape \x00 themselves.

jordanlewis · 2020-05-07T19:43:18Z

Oh yeah. True! And our Skip/GetLength functions actually iterate over the whole thing - not searching for the end directly. I do feel like that's not ideal... it would be better to have a fixed end so skipping / finding length could be just a simple bytes search. But maybe it's fine, since we don't even expect that we'll be seeing so many array keys.

rohany · 2020-05-07T19:45:32Z

Well they can't skip, because we can't encode the length of the bytes and strings at the front of the string/byte sequence. It seems they use some faster assembly implemented function to do it (bytes.IndexByte or something).

jordanlewis · 2020-05-07T23:54:34Z

Right, I'm saying we could use IndexByte if we went with a scheme where there was a fixed end byte. But I guess this is probably not worth it.

rohany · 2020-05-07T23:56:14Z

Ohh, I see what you mean. It would be tricky if that byte lived within other encoded data though -- we'd need our own escaping as well.

jordanlewis · 2020-05-08T00:25:34Z

Right. I guess it's not worth it.

LGTM

rohany · 2020-05-08T21:36:47Z

bors r+

craig · 2020-05-08T22:28:00Z

Build failed (retrying...)

GitHub CI (Cockroach)

craig · 2020-05-08T23:47:44Z

Build failed (retrying...)

GitHub CI (Cockroach)

rohany · 2020-05-09T00:35:06Z

bors r-

craig · 2020-05-09T00:35:08Z

Canceled

rohany · 2020-05-09T00:35:15Z

bors r+

craig · 2020-05-09T01:31:33Z

Build succeeded

GitHub CI (Cockroach)

rohany requested review from jordanlewis, otan and a team April 26, 2020 03:41

rohany requested a review from a team as a code owner April 26, 2020 03:41

rytaft reviewed Apr 26, 2020

View reviewed changes

rohany force-pushed the array-encodings branch from 7811ade to e9ad56d Compare April 26, 2020 18:51

jordanlewis reviewed Apr 26, 2020

View reviewed changes

rohany commented Apr 26, 2020

View reviewed changes

rohany force-pushed the array-encodings branch from e9ad56d to 646b62d Compare April 26, 2020 20:16

jordanlewis reviewed Apr 27, 2020

View reviewed changes

rohany commented Apr 27, 2020

View reviewed changes

jordanlewis reviewed Apr 27, 2020

View reviewed changes

rohany force-pushed the array-encodings branch from 646b62d to 5ac7528 Compare April 28, 2020 01:25

rohany force-pushed the array-encodings branch from 5ac7528 to 2ee1739 Compare April 28, 2020 01:51

jordanlewis reviewed Apr 28, 2020

View reviewed changes

rohany mentioned this pull request Apr 28, 2020

sql: allow ARRAYs in forward indexes #17154

Closed

rohany commented Apr 28, 2020

View reviewed changes

rohany force-pushed the array-encodings branch 2 times, most recently from e2e9cfc to 1196f7f Compare April 29, 2020 01:01

rohany force-pushed the array-encodings branch from 1196f7f to c6cc131 Compare April 29, 2020 01:51

rohany force-pushed the array-encodings branch from c6cc131 to 7206021 Compare April 29, 2020 02:35

jordanlewis reviewed May 6, 2020

View reviewed changes

rohany commented May 6, 2020

View reviewed changes

rohany force-pushed the array-encodings branch from 7206021 to 08ac641 Compare May 6, 2020 01:29

rohany force-pushed the array-encodings branch from 08ac641 to fdae19e Compare May 6, 2020 02:41

rohany force-pushed the array-encodings branch from fdae19e to 60cb535 Compare May 6, 2020 03:03

jordanlewis reviewed May 7, 2020

View reviewed changes

rohany requested a review from jordanlewis May 7, 2020 19:40

craig bot merged commit 07ef639 into cockroachdb:master May 9, 2020

jseldess mentioned this pull request Jun 24, 2020

sql: enable indexing and ordering on arrays of orderable and indexable types cockroachdb/docs#7621

Closed

ericharmeling mentioned this pull request Aug 5, 2020

sql: disable arrays in forward indexes cockroachdb/docs#7860

Closed

sql: enable indexing and ordering on arrays of orderable and indexable types #48045

sql: enable indexing and ordering on arrays of orderable and indexable types #48045

Conversation

rohany commented Apr 26, 2020 • edited Loading

cockroach-teamcity commented Apr 26, 2020

blathers-crl bot commented Apr 26, 2020

rytaft left a comment

Choose a reason for hiding this comment

rohany commented Apr 26, 2020 • edited Loading

jordanlewis left a comment

Choose a reason for hiding this comment

rohany left a comment

Choose a reason for hiding this comment

blathers-crl bot commented Apr 26, 2020

jordanlewis left a comment

Choose a reason for hiding this comment

rohany left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

rohany commented Apr 28, 2020

blathers-crl bot commented Apr 28, 2020

blathers-crl bot commented Apr 28, 2020

jordanlewis left a comment

Choose a reason for hiding this comment

rohany left a comment

Choose a reason for hiding this comment

blathers-crl bot commented Apr 29, 2020

blathers-crl bot commented Apr 29, 2020

blathers-crl bot commented Apr 29, 2020

awoods187 commented Apr 30, 2020

jordanlewis left a comment

Choose a reason for hiding this comment

rohany left a comment

Choose a reason for hiding this comment

blathers-crl bot commented May 6, 2020

blathers-crl bot commented May 6, 2020

jordanlewis left a comment

Choose a reason for hiding this comment

rohany commented May 7, 2020

jordanlewis commented May 7, 2020

rohany commented May 7, 2020

jordanlewis commented May 7, 2020

rohany commented May 7, 2020

jordanlewis commented May 8, 2020

rohany commented May 8, 2020

craig bot commented May 8, 2020

Build failed (retrying...)

craig bot commented May 8, 2020

Build failed (retrying...)

rohany commented May 9, 2020

craig bot commented May 9, 2020

Canceled

rohany commented May 9, 2020

craig bot commented May 9, 2020

Build succeeded

rohany commented Apr 26, 2020 •

edited

Loading

rohany commented Apr 26, 2020 •

edited

Loading