Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UniqueIndex range_de #500

Merged
merged 43 commits into from
Nov 24, 2021
Merged

UniqueIndex range_de #500

merged 43 commits into from
Nov 24, 2021

Conversation

uint
Copy link
Contributor

@uint uint commented Oct 20, 2021

Deals with #461

@uint uint requested a review from maurolacy October 20, 2021 06:18
@uint uint changed the base branch from main to 461-indexedmap-range_de October 20, 2021 06:19
packages/storage-plus/src/indexes.rs Outdated Show resolved Hide resolved
packages/storage-plus/src/indexes.rs Outdated Show resolved Hide resolved
packages/storage-plus/src/indexes.rs Outdated Show resolved Hide resolved
packages/storage-plus/src/indexes.rs Outdated Show resolved Hide resolved
packages/storage-plus/src/indexes.rs Show resolved Hide resolved
packages/storage-plus/src/indexes.rs Outdated Show resolved Hide resolved
@uint uint mentioned this pull request Oct 20, 2021
Base automatically changed from 461-indexedmap-range_de to main October 20, 2021 17:24
@maurolacy maurolacy self-assigned this Oct 26, 2021
@maurolacy maurolacy marked this pull request as ready for review October 26, 2021 13:54
@maurolacy
Copy link
Contributor

maurolacy commented Oct 27, 2021

OK, there's now a working impl of range_de / prefix_de for UniqueIndex and MultiIndex.

These impls come with some shortcomings / gotchas:

UniqueIndex: The PK type needs to be specified, in order to deserialize the pk to it. It comes with a default of (), which means, no deserialization / data will be provided for the primary key. Which could be good / useful for performance. This still needs to be tested, but should work well.

MultiIndex: The last element of the index tuple must be specified with the type you want it to be deserialized. That is, the last tuple element serves as a marker for the deserialization type (in the same way PK does it in UniqueIndex). This is not so convenient. The index function has to return a K (index key) composed of the proper types, and this implies converting a byte slice to the proper type, etc. This is done here in this way just for simplicity.

Clearly (as @uint said to me in chat), these indexes (particularly MultiIndex) merit a re-design / serious refactoring. To name a few points:

  • First, let's refactor UniqueIndex and MultiIndex into their own, different files, for ease of modification and maintenance. (Refactor UniqueIndex and MultiIndex into their own files #530) (done)
  • In MultiIndex, remove the pk from the index key spec. That is, handle index key multiplicity internally, using the pk or another method (I think this is doable, and I plan to address it in another iteration). Alternatively, make the index function type-aware on the pk element. That is, instead of using a byte slice, use a proper type for the pk element. (Remove the primary key from the MultiIndex key specification #533)
  • For MultiIndex, avoid querying the store twice to get the value associated with a pk. That means, storing the value in the index key directly (like UniqueIndex does). This will be clostlier in terms of storage, but more efficient / performant for iterations. It will also be simpler to implement (and more robust / clear). (no. We want that to handle updated / removed values without re-indexing).
  • Signal the pk deserialization type of MultiIndex using a PK trait or similar, like in UniqueIndex. In passing, rename K to IK (Index Key) for clarity. Then, make those PK traits automatic / inherited from the IndexedMap specification. That means, better encapsulation / coupling between IndexedMap and the *Index impls. (Improve MultiIndex pk deserialization #531)
  • Unify the key returned by UniqueIndex and MultiIndex. Currently, UniqueIndex returns just the pk, whereas MultiIndex returns the remaining of the key, including the pk as last element. Either make both return just the pk, or make UniqueIndex behave like MultiIndex (better for usability / options). (UniqueIndex / MultiIndex key consistency #532)

Will create issues with these, and they can be addressed eventually.

@maurolacy
Copy link
Contributor

maurolacy commented Oct 27, 2021

Also, there are a number of impls / methods that are still missing, for IndexedMap, SnapshotMap, and IndexedSnapshotMap.

Let's address those in another PR, though. (see #461 (comment))

@maurolacy
Copy link
Contributor

maurolacy commented Oct 27, 2021

@uint I cannot put you as reviewer, but please take a look. Also, some documentation on this (besides tests) is still missing. Let's add it before merging. Along with more tests.

@maurolacy maurolacy force-pushed the 461-uniqueindex-range_de branch 2 times, most recently from f106b3b to 5fb6ec5 Compare October 27, 2021 13:34
@ethanfrey
Copy link
Member

Some comments here... let's not get too overboard with more issues.

TL;DR: there is some good cleanup here that should be done. There are some larger design changes I would put off to later, and do in a non-breaking way, and support multiple MultiIndex approaches. But we should document them better for sure.

OK, there's now a working impl of range_de / prefix_de for UniqueIndex and MultiIndex.

These impls come with some shortcomings / gotchas:

UniqueIndex: The PK type needs to be specified, in order to deserialize the pk to it. It comes with a default of (), which means, no deserialization / data will be provided for the primary key. Which could be good / useful for performance. This still needs to be tested, but should work well.

Fair enough for now. But usually there is some critical info in the pk that is not in the value (like owner's address). We should deal with this in the future.

MultiIndex: The last element of the index tuple must be specified with the type you want it to be deserialized. That is, the last tuple element serves as a marker for the deserialization type (in the same way PK does it in UniqueIndex). This is not so convenient. The index function has to return a K (index key) composed of the proper types, and this implies converting a byte slice to the proper type, etc. This is done here in this way just for simplicity.

Ah, the 1 we store to hold it's place... yeah, making this "work" is okay. We could try to make it prettier, but that will probably be state-breaking (forcing a migration of all indexes), so we should do it asap or never. (Before many people build 1.0 contracts and cannot update to next cw-plus.

Clearly (as @uint said to me in chat), these indexes (particularly MultiIndex) merit a re-design / serious refactoring. To name a few points:

  • First, let's refactor UniqueIndex and MultiIndex into their own, different files, for ease of modification and maintenance.

Sounds good.

  • In MultiIndex, remove the pk from the index key spec. That is, handle index key multiplicity internally, using the pk or another method (I think this is doable, and I plan to address it in another iteration). Alternatively, make the index function type-aware on the pk element. That is, instead of using a byte slice, use a proper type for the pk element.

Huh? We use the pk as a minimal viable differentiation between the different indexed values. The considered alternative was to have one index, and point to a Vec of all pk's that fill that index. For small number of items at one index it is better, but becomes almost unusable if there are 1000s at one index.

We could provide 2 different implementations, one that uses pk in the key, other that stores Vec<Vec<u8>> for all pks as value. I would only use the second when you are sure some malicious actor cannot feed 1000s of values in there.

  • For MultiIndex, avoid querying the store twice to get the value associated with a pk. That means, storing the value in the index key directly (like UniqueIndex does). This will be clostlier in terms of storage, but more efficient / performant for iterations. It will also be simpler to implement (and more robust / clear).

Storing the value directly in the index has an issue... every time you update the type, you must update all references. Currently, if you modify a value, but it doesn't change the indexes (you change different fields), those are not updated at all (at least that was the intention). Again, this is a design trade-off and I very consciously made this choice. Maybe the other one is valid as well.

The idea of doing some cleanup (multiple files) and then providing multiple MultiIndex implementations does seem okay.

  • Signal the pk deserialization type of MultiIndex using a PK trait or similar, like in UniqueIndex. In passing, rename K to IK (Index Key) for clarity. Then, make those PK traits automatic / inherited from the IndexedMap specification. That means, better encapsulation / coupling between IndexedMap and the *Index impls.

Makes sense

  • Unify the key returned by UniqueIndex and MultiIndex. Currently, UniqueIndex returns just the pk, whereas MultiIndex returns the remaining of the key, including the pk as last element. Either make both return just the pk, or make UniqueIndex behave like MultiIndex (better for usability / options).

Very good point.

@maurolacy
Copy link
Contributor

maurolacy commented Oct 27, 2021

TL;DR: there is some good cleanup here that should be done. There are some larger design changes I would put off to later, and do in a non-breaking way, and support multiple MultiIndex approaches. But we should document them better for sure.

Agreed.

Fair enough for now. But usually there is some critical info in the pk that is not in the value (like owner's address). We should deal with this in the future.

Just to be clear: You can get those pks deserialized, but you need to specify their type using the PK generic. It's just that we have a PK = () in the struct definition, for backwards compatibility.

MultiIndex: The last element of the index tuple must be specified with the type you want it to be deserialized. That is, the last tuple element serves as a marker for the deserialization type (in the same way PK does it in UniqueIndex). This is not so convenient. The index function has to return a K (index key) composed of the proper types, and this implies converting a byte slice to the proper type, etc. This is done here in this way just for simplicity.

Ah, the 1 we store to hold it's place... yeah, making this "work" is okay. We could try to make it prettier, but that will probably be state-breaking (forcing a migration of all indexes), so we should do it asap or never. (Before many people build 1.0 contracts and cannot update to next cw-plus.

OK. I think the best approach would be to separate the pk from the index, at the generic types level (i.e. having a proper PK generic in the multi index definition). Can work on this soon, as I have a good overview on how to do this now.

  • In MultiIndex, remove the pk from the index key spec. That is, handle index key multiplicity internally, using the pk or another method (I think this is doable, and I plan to address it in another iteration). Alternatively, make the index function type-aware on the pk element. That is, instead of using a byte slice, use a proper type for the pk element.

Huh? We use the pk as a minimal viable differentiation between the different indexed values. The considered alternative was to have one index, and point to a Vec of all pk's that fill that index. For small number of items at one index it is better, but becomes almost unusable if there are 1000s at one index.

Maybe I wasn't clear. I meant, removing the need to specify the pk in the index tuple. I.e. handling key multiplicity internally / opaquely to the user. I think this can be done, and have an idea on how to do it (related to my comment above).

We could provide 2 different implementations, one that uses pk in the key, other that stores Vec<Vec<u8>> for all pks as value. I would only use the second when you are sure some malicious actor cannot feed 1000s of values in there.

  • For MultiIndex, avoid querying the store twice to get the value associated with a pk. That means, storing the value in the index key directly (like UniqueIndex does). This will be clostlier in terms of storage, but more efficient / performant for iterations. It will also be simpler to implement (and more robust / clear).

Storing the value directly in the index has an issue... every time you update the type, you must update all references. Currently, if you modify a value, but it doesn't change the indexes (you change different fields), those are not updated at all (at least that was the intention). Again, this is a design trade-off and I very consciously made this choice. Maybe the other one is valid as well.

You are right, I hadn't thought of that. It's costlier in terms of updates / removals. And, it has to be done for all the multi indexes. I like the "pk indirection" approach more now.

The idea of doing some cleanup (multiple files) and then providing multiple MultiIndex implementations does seem okay.

  • Signal the pk deserialization type of MultiIndex using a PK trait or similar, like in UniqueIndex. In passing, rename K to IK (Index Key) for clarity. Then, make those PK traits automatic / inherited from the IndexedMap specification. That means, better encapsulation / coupling between IndexedMap and the *Index impls.

Makes sense

The last part about encapsulation / coupling is not fully clear to me yet, but it can probably be done with some add_index helper or so.

  • Unify the key returned by UniqueIndex and MultiIndex. Currently, UniqueIndex returns just the pk, whereas MultiIndex returns the remaining of the key, including the pk as last element. Either make both return just the pk, or make UniqueIndex behave like MultiIndex (better for usability / options).

Very good point.

I can work on this stuff next month, if there's not so much urgency with contracts work. I would really like to put this in a better / more consistent / more user friendly shape asap.

@uint
Copy link
Contributor Author

uint commented Oct 28, 2021

@maurolacy Just looked through this. I don't think I have any specific nitpicks or ideas. Nice to see it works now and is more complete, shame about the less than pretty stuff.

@maurolacy
Copy link
Contributor

maurolacy commented Oct 29, 2021

No worries. I'll document this a little when finding some time, and I think we can merge it. And handle improvements in follow-ups.

@ethanfrey
Copy link
Member

I agree this is mergeable.
Some follow up is clear.

Let's find time to discuss architecture for the larger follow-up issues before tackling them (but feel free to make issues in the meantime)

@ethanfrey
Copy link
Member

ethanfrey commented Nov 15, 2021

Oh, man, there are like 5 PRs built on top of this, right?

Can you merge main in sometime (no more rebase possible)

@maurolacy
Copy link
Contributor

Yes, I'll rebase everything once the int key changes are merged to main.

@maurolacy
Copy link
Contributor

Can you merge main in sometime (no more rebase possible)

Just rebased it without issues.

@ethanfrey
Copy link
Member

Cool. Want to merge this first, then the other branches on top?

@maurolacy
Copy link
Contributor

Sure.

@maurolacy
Copy link
Contributor

OK, all the follow-up issues / comments are addressed, are changes are merged / integrated here.

This is also rebased from master.

Let's merge this.

@uint
Copy link
Contributor Author

uint commented Nov 24, 2021

LGTM overall. Does what we need it to!

@maurolacy maurolacy merged commit bdc9965 into main Nov 24, 2021
@maurolacy maurolacy deleted the 461-uniqueindex-range_de branch November 24, 2021 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants