Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Reorganise collection description storage #1988

Merged

Conversation

AndrewSisley
Copy link
Contributor

@AndrewSisley AndrewSisley commented Oct 20, 2023

Relevant issue(s)

Resolves #1964 #1913

Description

Reorganises collection description storage so that it actually makes sense.

The existing key/value setup has been replaced with the following keys:

[Collection.ID] => json // Json should be stored against an immutable index, i.e. not against `Name`
[Collection.Name] => [Collection.ID]  // This means collection names must be unique

We also need the following for the short-term, before Collection.GlobalID gets added, as otherwise if a collection is updated to a new schema version, we lose track of what collection it could be. The locations that use this have been linked to #1085

[Collection.SchemaVersionId]/[Collection.ID] => nil

With this change the storage setup should fully support multiple collections from a single schema, however this is blocked off by setDefaultSchemaVersion (it doesn't allow users to explicitly specify a collection, so atm there is no way to try and do this). Later when we introduce patchCollection we will need to be mindful of this and block off user's ability to do this (if done before #1085).

Strongly recommend reviewing commit by commit, I've tried to add a fair bit of context in each commit message.

@AndrewSisley AndrewSisley added area/collections Related to the collections system refactor This issue specific to or requires *notable* refactoring of existing codebases and components action/no-benchmark Skips the action that runs the benchmark. labels Oct 20, 2023
@AndrewSisley AndrewSisley added this to the DefraDB v0.8 milestone Oct 20, 2023
@AndrewSisley AndrewSisley requested a review from a team October 20, 2023 23:13
@AndrewSisley AndrewSisley self-assigned this Oct 20, 2023
@AndrewSisley AndrewSisley force-pushed the 1964-col-desc-storage branch from a04298c to 6039528 Compare October 23, 2023 14:26
@codecov
Copy link

codecov bot commented Oct 23, 2023

Codecov Report

Attention: 118 lines in your changes are missing coverage. Please review.

Comparison is base (0f6acba) 74.16% compared to head (7ba9245) 73.99%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1988      +/-   ##
===========================================
- Coverage    74.16%   73.99%   -0.18%     
===========================================
  Files          246      247       +1     
  Lines        24426    24548     +122     
===========================================
+ Hits         18115    18162      +47     
- Misses        5097     5150      +53     
- Partials      1214     1236      +22     
Flag Coverage Δ
all-tests 73.99% <59.45%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
net/peer_collection.go 55.08% <100.00%> (-2.95%) ⬇️
net/process.go 88.24% <100.00%> (ø)
db/schema.go 84.36% <66.67%> (-1.01%) ⬇️
core/key.go 88.34% <85.71%> (-0.62%) ⬇️
db/description/errors.go 0.00% <0.00%> (ø)
db/description/schema.go 57.98% <54.84%> (+1.67%) ⬆️
db/collection.go 67.80% <64.77%> (-1.57%) ⬇️
db/description/collection.go 50.00% <50.00%> (ø)

... and 5 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f6acba...7ba9245. Read the comment docs.

@AndrewSisley AndrewSisley force-pushed the 1964-col-desc-storage branch from 6039528 to 455692e Compare October 23, 2023 18:04
core/key.go Outdated
DATASTORE_DOC_VERSION_FIELD_ID = "v"
REPLICATOR = "/replicator/id"
P2P_COLLECTION = "/p2p/collection"
COLLECTION = "/collection/value"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I think COLLECTION_NUMBER = "/collection/number" would be a bit more representative in this case. I had mentionned previously that we could have collection.Index instead of the current collection.ID so that we could use collection.ID for the global ID. The problem is that it would then create confusion with the secondary indexes. This is why I'm suggesting to use collection.Number.

suggestion: I think having COLLECTION_ID = "/collection/id" would be important to reference the collection with it's global ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why collection/number and not collection/index?
I assume because it would clash with the db indexes on collection (i.e. what Islam has been working on).

thought: colllection.Index will also be a confusing name for that reason. Maybe we can find something else when we get round to looking at that.

Then again, I think this key prefix should align with that property, which means it should either be left as collection/id for now, or we need to solve the collection.ID property name problem here/now. Worth bearing in mind is that changing this is a breaking change, so it is better if we don't change it every other day.

praise: Thanks for bringing this up, is a good spot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it would then create confusion with the secondary indexes. This is why I'm suggesting to use collection.Number.

Did I misread, or did you edit? 😅

I missed this when replying, number feels a bit strange, but I have no better suggestions yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: colllection.Index will also be a confusing name for that reason. Maybe we can find something else when we get round to looking at that.

collection.Number. would be fine as noted above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, I think this key prefix should align with that property, which means it should either be left as collection/id for now, or we need to solve the collection.ID property name problem here/now. Worth bearing in mind is that changing this is a breaking change, so it is better if we don't change it every other day.

I really think collection.ID should be used for the global ID.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I misread, or did you edit? 😅

I did not edit :)

Copy link
Contributor Author

@AndrewSisley AndrewSisley Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in stand-up, we are undecided but we set a deadline for noon Tuesday Ottawa time to pick something. Number is still a bit weird, but otherwise received little criticism (unlike LocalID, which could be confusing for a few reasons. If no consensus is reached by the deadline I will probably rename to Number.

  • Wait/discuss/rename to Number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving as is for now, hopefully this can be removed in #1085

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something about the actual key string and though? Value isn't representative of the key.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, true sorry - I'll change it to collection/id

  • collection/id

Comment on lines +35 to +49
err = txn.Systemstore().Put(ctx, key.ToDS(), buf)
if err != nil {
return client.CollectionDescription{}, err
}

idBuf, err := json.Marshal(desc.ID)
if err != nil {
return client.CollectionDescription{}, err
}

nameKey := core.NewCollectionNameKey(desc.Name)
err = txn.Systemstore().Put(ctx, nameKey.ToDS(), idBuf)
if err != nil {
return client.CollectionDescription{}, err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I would do the opposite here. I would store the description on the name key instead of the collection number key because when querying, users will do so using the collection name. This will prevent the extra key lookup on queries which is where speed is most important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do disagree with you here, for the following reasons:

  • Name is mutable, and if the description is stored against it, renaming the collection becomes more expensive and error prone as we have to move the definition and re-index everything.
  • The performance cost should be very very small, it is an extra single-point lookup.
  • I find it conceptually to make more sense to store the thing against it's fixed ID, and index that, than to store it against a mutable property on it, and then have everything else reference the mutable property.

I think the maintainability of storing against the ID far outweigh the performance gains here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name is mutable, and if the description is stored against it, renaming the collection becomes more expensive and error prone as we have to move the definition and re-index everything.

That is an excellent point. I should have thought about that. Alright let's keep it as is then :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I should have noted this one in the PR description I think.

Copy link
Member

@shahzadlone shahzadlone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enjoyed all the improvement nuggets, cheers! As for the decision on picking a name for the Sequence, Index, ID, Number situation will the final decision be done in a later PR?

@AndrewSisley
Copy link
Contributor Author

I enjoyed all the improvement nuggets, cheers! As for the decision on picking a name for the Sequence, Index, ID, Number situation will the final decision be done in a later PR?

Nice :)

Yes, that name can be revisited/removed when GlobalID gets implemented.

I think we may have been dropping unpatched collections from the GQL types here.

EDIT: I was incorrect, this was not a bug, it is working fine as we were passing in all the schema/cols anyway.  I do however think this change is an improvement anyway and should be kept.
These two tests fail for mock-related reasons in a later commit, so I am removing them.  Index related code has not changed.
- Some index uint test setup code was removed, it no longer compiles, and removing it does not cause any tests to fail.
- A P2P (un)subscribe error was improved, and the location that gererated the error was moved due to the change in how collection/schema are fetched.
- The order in which collections are returned for backup/export has changed, due to the changes in the collection store keys, and so the tests have changed - the only change to those tests is that User data is now returned/expected before Book - the data within each collection-set has not been modified.

The best explaination for how the storage has changed is either the PR description or the issue.
This should have been changed in an earlier PR really, but now it is extra visible/ugly.  These prefixes should now all make sense.
@AndrewSisley AndrewSisley force-pushed the 1964-col-desc-storage branch from 455692e to 355bd56 Compare October 24, 2023 15:04
"github.com/sourcenetwork/defradb/datastore"
)

func SaveCollection(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: would be nice to have a short documentation for this and other methods

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A was going to point that out too. All the public functions should have documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the public functions should have documentation.

These arent public, they are internal to defra

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are very much public at the moment though. They will be accessible to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not consider these to be public, and neither does at least one other dev. I also think that conversation is out of scope here too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly disagree with this one. It's either public or it's not. The functions are accessible by users and will be found in the go docs. They should be documented. If we want to keep them internal, they can be under an internal package.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed over discord, the internal package conversation will be re-opened outside of this PR. I'll add some docs.

  • Add docs for funcs

txn datastore.Txn,
desc client.CollectionDescription,
) (client.CollectionDescription, error) {
buf, err := json.Marshal(desc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: buffer implies some kind of preallocated memory or something that deals with dynamic memory allocation, but this is just a slice of bytes.
Why not to call it descJSON or descBytes?
The same applies to idBuf

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A go standard is also just to use b in a case like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that short var names like this are only acceptable for very short functions (I'd say up to 6 lines of code) otherwise it doesn't do any good to readability.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me as long as you can see the whole function on a smallish screen. then it's fine. That probably means around 50 lines. But definitely more than 6. As long as there aren't many more byte arrays defined. That being said, it was only a option. Your suggestions are good as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use buf all over the place in our codebase, and I'm pretty sure I've seen it for stuff like this in a fair few other codebases. Islam, internally within json.Marshal the return value probably does fit your some kind of preallocated memory description and that's how I've seen this naming come from.

Let me know if either of you think this is important, as I don't, and without clear consensus between you guys I'll leave it as is otherwise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buf is misrepresenting what this is but we do have a few places where we use it probably by mistake including one that Islam wrote
Screenshot 2023-10-24 at 2 42 21 PM
but most were added by Andy in #1965. buf is usually used for bytes.Buffer as we do a lot in the code base.

My opinion is that it's the wrong name for the variable but I wouldn't say its important to me in this case. It was good to point it out though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but most were added by Andy in #1965.

nitpick: If you look at db/collection.go in that PR, you should see that these variables were fooBuf before :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough 🙂

PRIMARY_KEY = "/pk"
DATASTORE_DOC_VERSION_FIELD_ID = "v"
REPLICATOR = "/replicator/id"
P2P_COLLECTION = "/p2p/collection"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: You didn't want to add the following as well as mentioned on the issue?

SCHEMA_ROOT_COLLECTION            = "/schema/root/collection" // "{rootSchemaID}/{collectionName} -> someEmptyMarker
SCHEMA_VERSION_COLLECTION         = "/schema/version/collection" // "{schemaVersionID}/{collectionName} -> someEmptyMarker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that is out of scope, and I think it would be dead code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a public utility so not dead code. But I agree that it can be out-of-scope.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that you're already doing the same as SCHEMA_VERSION_COLLECTION with SCHEMA_VERSION.

colIDs = append(colIDs, colSchemaVersionKey.CollectionID)
}

cols := make([]client.CollectionDescription, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: make([]client.CollectionDescription, 0, colIDs)

Copy link
Contributor Author

@AndrewSisley AndrewSisley Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers :)

  • make([]client.CollectionDescription, len(colIDs))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: it's great that you moved all these highly cohesive functions into a separate package. It makes the rest of code much cleaner.

Copy link
Contributor

@islamaliev islamaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job, Andy!

Comment on lines +153 to +160
newCollections, err := db.getAllCollections(ctx, txn)
if err != nil {
return err
}

newCollections = append(newCollections, col.Definition())
definitions := make([]client.CollectionDefinition, len(newCollections))
for i, col := range newCollections {
definitions[i] = col.Definition()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tought: It would be nice to have a method db.getAllCollectionDefinitions that would return the definitions directly instead of having to loop through collections to get what we actually want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I think that would only be used here no? And if it wasn't it may end up coupling totally unrelated parts of the codebase together - is not like this loop is particularly complex and in need of standardisation.

Leaving as is unless significant push back received.


// The need for this key is temporary, we should replace it with the global collection ID
// https://github.com/sourcenetwork/defradb/issues/1085
schemaVersionKey := core.NewCollectionSchemaVersionKey(desc.SchemaVersionID, desc.ID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tought: In its current form, this should probably not be removed when adding global collection ID since it will enable some nice DB admin feature where one can check which collections are at a given schema version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replacement would handle that, there would be no loss of user/admin functionality.

@AndrewSisley AndrewSisley merged commit d8eaaef into sourcenetwork:develop Oct 24, 2023
29 checks passed
@AndrewSisley AndrewSisley deleted the 1964-col-desc-storage branch October 24, 2023 20:11
shahzadlone pushed a commit to shahzadlone/defradb that referenced this pull request Feb 23, 2024
## Relevant issue(s)

Resolves sourcenetwork#1964 sourcenetwork#1913 

## Description

Reorganises collection description storage so that it actually makes
sense.

The existing key/value setup has been replaced with the following keys:
```
[Collection.ID] => json // Json should be stored against an immutable index, i.e. not against `Name`
[Collection.Name] => [Collection.ID]  // This means collection names must be unique
```
We also need the following for the short-term, before
Collection.GlobalID gets added, as otherwise if a collection is updated
to a new schema version, we lose track of what collection it could be.
The locations that use this have been linked to
sourcenetwork#1085
```
[Collection.SchemaVersionId]/[Collection.ID] => nil
```

With this change the storage setup should fully support multiple
collections from a single schema, however this is blocked off by
`setDefaultSchemaVersion` (it doesn't allow users to explicitly specify
a collection, so atm there is no way to try and do this). Later when we
introduce `patchCollection` we will need to be mindful of this and block
off user's ability to do this (if done before sourcenetwork#1085).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/no-benchmark Skips the action that runs the benchmark. area/collections Related to the collections system refactor This issue specific to or requires *notable* refactoring of existing codebases and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detangle collection storage
4 participants