Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: enable indexing and ordering on arrays of orderable and indexable types #48045

Merged
merged 1 commit into from
May 9, 2020

Conversation

rohany
Copy link
Contributor

@rohany rohany commented Apr 26, 2020

Fixes #17154.
Fixes #35707.

This PR enables arrays to be ordered and indexed by
introducing an ordered key encoding for arrays.
Once this exists, the rest of the SQL infrastructure
is ready to handle indexing and ordering on arrays.

To encode an array of elements ARRAY[a, b],
we create the following encoding.

Let AM = a marker byte for arrays and let AT be a terminator byte.

enc(ARRAY[a, b]) = [AM, enc(a), enc(b), AT]

The key is that the terminator is less than the element marker.
This allows for the "prefix matching" style comparison that
arrays support.

Release note (sql change): This PR adds support for indexing
and ordering of arrays of indexable and orderable inner types.

@rohany rohany requested review from jordanlewis, otan and a team April 26, 2020 03:41
@rohany rohany requested a review from a team as a code owner April 26, 2020 03:41
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@blathers-crl
Copy link

blathers-crl bot commented Apr 26, 2020

❌ The GitHub CI (Cockroach) build has failed on 7811ade7.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 11 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @otan, and @rohany)


pkg/sql/opt/optbuilder/testdata/orderby, line 391 at r1 (raw file):

error (42P01): no data source matches prefix: a

build

Can you also add a test here for arrays of type json?

@rohany
Copy link
Contributor Author

rohany commented Apr 26, 2020

We don't currently support arrays of json (#23468)

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this will be a great improvement!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @otan, and @rohany)


pkg/sql/logictest/testdata/logic_test/array, line 1352 at r1 (raw file):

# TODO (rohany): We have always done this for array comparisons, so I think
#  it would be a breaking change + opposite with our other null behavior to
#  change it suddenly...

I agree. Can you check that Postgres changes the null sorting behavior within arrays when you change NULLS FIRST / NULLS LAST?


pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

// encodeArrayKey generates an ordered key encoding of an array.
// The encoding format for an array [a, b] is as follows:
// [arrayMarker, elementMarker, enc(a), elementMarker, enc(b), terminator]

I don't understand why we need the markers. Wouldn't the sorting work just as well if we simply did [enc(a), enc(b)] in your example?

It looks like arrays are supposed to sort lexicographically - it's element by element comparison, and longer arrays lose. But I think that that would already be the case with normal key encoding:

[enc(a), enc(b)] would be less than [enc(a), enc(b), anything else] just by the normal lexicographical sorting that keys use.

Does NULL somehow break this?

Copy link
Contributor Author

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @otan, and @rohany)


pkg/sql/logictest/testdata/logic_test/array, line 1352 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I agree. Can you check that Postgres changes the null sorting behavior within arrays when you change NULLS FIRST / NULLS LAST?

It doesn't seem to:

rohany=# create table t (x int[]);
CREATE TABLE
rohany=# insert into t values (array[1]), (array[null::int]);
INSERT 0 2
rohany=# select * from t order by x
rohany-# ;
   x
--------
 {1}
 {NULL}
(2 rows)

rohany=# select * from t order by x NULLS first;
   x
--------
 {1}
 {NULL}
(2 rows)

rohany=# select * from t order by x NULLS last;
   x
--------
 {1}
 {NULL}
(2 rows)

pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I don't understand why we need the markers. Wouldn't the sorting work just as well if we simply did [enc(a), enc(b)] in your example?

It looks like arrays are supposed to sort lexicographically - it's element by element comparison, and longer arrays lose. But I think that that would already be the case with normal key encoding:

[enc(a), enc(b)] would be less than [enc(a), enc(b), anything else] just by the normal lexicographical sorting that keys use.

Does NULL somehow break this?

Yeah, the issue is NULLs. Nulls are encoded as the byte 0x00. If we don't have the markers, then null would conflict with the array terminator. But we can't just add a marker for NULL only, because then it wouldn't sort right against non-null elements. Am I missing something straightforward here?

@blathers-crl
Copy link

blathers-crl bot commented Apr 26, 2020

❌ The GitHub CI (Cockroach) build has failed on e9ad56d9.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rohany)


pkg/sql/logictest/testdata/logic_test/array, line 1352 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

It doesn't seem to:

rohany=# create table t (x int[]);
CREATE TABLE
rohany=# insert into t values (array[1]), (array[null::int]);
INSERT 0 2
rohany=# select * from t order by x
rohany-# ;
   x
--------
 {1}
 {NULL}
(2 rows)

rohany=# select * from t order by x NULLS first;
   x
--------
 {1}
 {NULL}
(2 rows)

rohany=# select * from t order by x NULLS last;
   x
--------
 {1}
 {NULL}
(2 rows)

Ok. That's a little bit unfortunate as it might mean that we're supposed to do this no matter what. But I guess I agree that we cant change our behavior now anyway.


pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

Yeah, the issue is NULLs. Nulls are encoded as the byte 0x00. If we don't have the markers, then null would conflict with the array terminator. But we can't just add a marker for NULL only, because then it wouldn't sort right against non-null elements. Am I missing something straightforward here?

I'm not sure yet. By the way how would this interact with multi-column keys? Maybe my lexicographic ordering comment was wrong, since of course keys can be longer than just a single column. How does the ordering distinguish between short and long columns when there could be a second column afterward?

Copy link
Contributor Author

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)


pkg/sql/logictest/testdata/logic_test/array, line 1352 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Ok. That's a little bit unfortunate as it might mean that we're supposed to do this no matter what. But I guess I agree that we cant change our behavior now anyway.

We'd have to do something a little wonky like switch the null encoding direction when encoding arrays to support this


pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I'm not sure yet. By the way how would this interact with multi-column keys? Maybe my lexicographic ordering comment was wrong, since of course keys can be longer than just a single column. How does the ordering distinguish between short and long columns when there could be a second column afterward?

there is an array terminator byte, so thats how we know. I'll add some tests for it though.

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rohany)


pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

there is an array terminator byte, so thats how we know. I'll add some tests for it though.

I see.

Could we encode null as a byte that's greater than 0x00, but less than all other values? It looks like we could use 0x01, which is reserved for encodedNotNull which is never actually used in a key.

@rohany
Copy link
Contributor Author

rohany commented Apr 28, 2020

PTAL, updated the encoding strategy to not use the extra marker bytes between elements, and instead use a different null encoding within the array

@blathers-crl
Copy link

blathers-crl bot commented Apr 28, 2020

❌ The GitHub CI (Cockroach) build has failed on 5ac7528b.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl
Copy link

blathers-crl bot commented Apr 28, 2020

❌ The GitHub CI (Cockroach) build has failed on 2ee17393.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rohany)


pkg/sql/logictest/testdata/logic_test/array, line 1319 at r2 (raw file):


subtest array_indexes

Other testing ideas:

arrays w/ composite types (decimal)
arrays w/ strings? cuz their encoding is escape based - if we changed our decoder to try to search for an endpoint it would fail
more than 1 array in a single index
group by, distinct, join on array (or hmm we already added those, right?)
foreign key to array
lookup join to array? i'm not sure if these are necessary... just thinking

can you add execbuilder tests for the array scans that show the spans that get generated for array range queries?


pkg/util/encoding/encoding.go, line 1294 at r2 (raw file):

	TimeTZ       Type = 19
	Geo          Type = 20
	ArrayKeyAsc  Type = 21 // Array key encoding

But what about the existing Array byte? I still don't understand why we can't reuse it.

Also, how come most of these other fields don't have Desc equivalents? What makes array/string/etc special?

Copy link
Contributor Author

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)


pkg/sql/logictest/testdata/logic_test/array, line 1319 at r2 (raw file):

arrays w/ composite types (decimal)

already included

arrays w/ strings? cuz their encoding is escape based - if we changed our decoder to try to search for an endpoint it would fail

Done, added bytes too.

more than 1 array in a single index

Done

group by, distinct, join on array (or hmm we already added those, right?)

groupby and distinct are tested, added join tests

foreign key to array

Done -- this helped to catch a bug!

lookup join to array? i'm not sure if these are necessary... just thinking

Done

can you add execbuilder tests for the array scans that show the spans that get generated for array range queries?

done


pkg/sql/sqlbase/column_type_encoding.go, line 927 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I see.

Could we encode null as a byte that's greater than 0x00, but less than all other values? It looks like we could use 0x01, which is reserved for encodedNotNull which is never actually used in a key.

Done.


pkg/util/encoding/encoding.go, line 1294 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

But what about the existing Array byte? I still don't understand why we can't reuse it.

Also, how come most of these other fields don't have Desc equivalents? What makes array/string/etc special?

RE the existing Array byte -- I wanted to get rid of this, but there are some places where we lose information when we do that. For example, when attempting to pretty print an array key, we need to know whether it is a key or value encoding of an array, where if we just use the existing Array type we won't have that information.

RE the others, I think the decoding logic can just figure out what to do based on the encodings themselves. If anything, i think that these type bytes are mostly exported views of the internal marker bytes.

@rohany rohany force-pushed the array-encodings branch 2 times, most recently from e2e9cfc to 1196f7f Compare April 29, 2020 01:01
@blathers-crl
Copy link

blathers-crl bot commented Apr 29, 2020

❌ The GitHub CI (Cockroach) build has failed on 1196f7f5.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl
Copy link

blathers-crl bot commented Apr 29, 2020

❌ The GitHub CI (Cockroach) build has failed on 1196f7f5.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl
Copy link

blathers-crl bot commented Apr 29, 2020

❌ The GitHub CI (Cockroach) build has failed on c6cc1315.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@awoods187
Copy link
Contributor

I'm excited about this work!

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looks good but let's just get rid of the desc tag and type everywhere - we can do as the other types do when they need descending encoding.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @otan, and @rohany)


pkg/sql/rowexec/version_history.txt, line 114 at r3 (raw file):

- Version: 29 (MinAcceptedVersion: 29)
    - The hashing of DArray datums was changed and is incompatible with the
      existing method of hasing DArrays.

nit: s/hasing/hashing/


pkg/sql/sqlbase/column_type_encoding_test.go, line 213 at r3 (raw file):

	// Also run the property on arrays possibly containing NULL values.
	// The random generator in the property above does not generate NULLs.
	properties.Property("order-preserving-arrays", prop.ForAll(

Heh how do you like using this Gopter thing


pkg/util/encoding/encoding.go, line 88 at r3 (raw file):

	// Markers and terminators for key encoding Datum arrays in sorted order.
	arrayKeyMarker                    = geoMarker + 1
	arrayKeyDescendingMarker          = arrayKeyMarker + 1

I think that we don't need these descending markers. See discussion here: https://cockroachlabs.slack.com/archives/C0HM2DZ34/p1588718172047500?thread_ts=1588717530.044900&cid=C0HM2DZ34

The tl;dr is that the descending thing isn't necessary since we no longer have a self-describing key format. The fact that bytesDesc exists was an early attempt and isn't strictly necessary.

Copy link
Contributor Author

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)


pkg/sql/sqlbase/column_type_encoding_test.go, line 213 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Heh how do you like using this Gopter thing

lol its pretty nice -- tho it's definitely super wonky to start with.


pkg/util/encoding/encoding.go, line 88 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think that we don't need these descending markers. See discussion here: https://cockroachlabs.slack.com/archives/C0HM2DZ34/p1588718172047500?thread_ts=1588717530.044900&cid=C0HM2DZ34

The tl;dr is that the descending thing isn't necessary since we no longer have a self-describing key format. The fact that bytesDesc exists was an early attempt and isn't strictly necessary.

added a comment about this

@rohany rohany force-pushed the array-encodings branch from 7206021 to 08ac641 Compare May 6, 2020 01:29
@blathers-crl
Copy link

blathers-crl bot commented May 6, 2020

❌ The GitHub CI (Cockroach) build has failed on 08ac641a.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@rohany rohany force-pushed the array-encodings branch from 08ac641 to fdae19e Compare May 6, 2020 02:41
@blathers-crl
Copy link

blathers-crl bot commented May 6, 2020

❌ The GitHub CI (Cockroach) build has failed on fdae19e4.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

…e types

Fixes cockroachdb#17154.
Fixes cockroachdb#35707.

This PR enables arrays to be ordered and indexed by
introducing an ordered key encoding for arrays.
Once this exists, the rest of the SQL infrastructure
is ready to handle indexing and ordering on arrays.

To encode an array of elements `ARRAY[a, b]`,
we create the following encoding.

Let `AM` = a marker byte for arrays and let `AT` be a terminator byte.

`enc(ARRAY[a, b]) = [AM, enc(a), enc(b), AT]`

The key is that the terminator is less than the element marker.
This allows for the "prefix matching" style comparison that
arrays support.

Release note (sql change): This PR adds support for indexing
and ordering of arrays of indexable and orderable inner types.
@rohany rohany force-pushed the array-encodings branch from fdae19e to 60cb535 Compare May 6, 2020 03:03
Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @rohany)


pkg/sql/sqlbase/column_type_encoding.go, line 927 at r4 (raw file):

// encodeArrayKey generates an ordered key encoding of an array.
// The encoding format for an array [a, b] is as follows:
// [arrayMarker, enc(a), enc(b), terminator].

I'm afraid this is still going to cause problems because of string encoding.

In fact, I'm confused how it didn't already break your string test cases. Encoded strings are terminated with the sequence \x00\x01. But our terminator is also \x00, isn't it? What am I missing...

@rohany
Copy link
Contributor Author

rohany commented May 7, 2020

The important thing is the context in what is attempting to be decoded. When we are decoding a string, we know we are looking for \x00\x01 as the terminator. once we return from decoding a string, we go back to the context of an array, and interpret \x00 as a terminator. Additionally, the string encodings escape \x00 themselves.

@rohany rohany requested a review from jordanlewis May 7, 2020 19:40
@jordanlewis
Copy link
Member

Oh yeah. True! And our Skip/GetLength functions actually iterate over the whole thing - not searching for the end directly. I do feel like that's not ideal... it would be better to have a fixed end so skipping / finding length could be just a simple bytes search. But maybe it's fine, since we don't even expect that we'll be seeing so many array keys.

@rohany
Copy link
Contributor Author

rohany commented May 7, 2020

Well they can't skip, because we can't encode the length of the bytes and strings at the front of the string/byte sequence. It seems they use some faster assembly implemented function to do it (bytes.IndexByte or something).

@jordanlewis
Copy link
Member

Right, I'm saying we could use IndexByte if we went with a scheme where there was a fixed end byte. But I guess this is probably not worth it.

@rohany
Copy link
Contributor Author

rohany commented May 7, 2020

Ohh, I see what you mean. It would be tricky if that byte lived within other encoded data though -- we'd need our own escaping as well.

@jordanlewis
Copy link
Member

Right. I guess it's not worth it.

LGTM :shipit:

@rohany
Copy link
Contributor Author

rohany commented May 8, 2020

bors r+

@craig
Copy link
Contributor

craig bot commented May 8, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented May 8, 2020

Build failed (retrying...)

@rohany
Copy link
Contributor Author

rohany commented May 9, 2020

bors r-

@craig
Copy link
Contributor

craig bot commented May 9, 2020

Canceled

@rohany
Copy link
Contributor Author

rohany commented May 9, 2020

bors r+

@craig
Copy link
Contributor

craig bot commented May 9, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: support ordering by array columns sql: allow ARRAYs in forward indexes
5 participants