-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Reorganise collection description storage #1988
refactor: Reorganise collection description storage #1988
Conversation
a04298c
to
6039528
Compare
Codecov ReportAttention:
@@ Coverage Diff @@
## develop #1988 +/- ##
===========================================
- Coverage 74.16% 73.99% -0.18%
===========================================
Files 246 247 +1
Lines 24426 24548 +122
===========================================
+ Hits 18115 18162 +47
- Misses 5097 5150 +53
- Partials 1214 1236 +22
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 5 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
6039528
to
455692e
Compare
core/key.go
Outdated
DATASTORE_DOC_VERSION_FIELD_ID = "v" | ||
REPLICATOR = "/replicator/id" | ||
P2P_COLLECTION = "/p2p/collection" | ||
COLLECTION = "/collection/value" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I think COLLECTION_NUMBER = "/collection/number"
would be a bit more representative in this case. I had mentionned previously that we could have collection.Index
instead of the current collection.ID
so that we could use collection.ID
for the global ID. The problem is that it would then create confusion with the secondary indexes. This is why I'm suggesting to use collection.Number
.
suggestion: I think having COLLECTION_ID = "/collection/id"
would be important to reference the collection with it's global ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why collection/number
and not collection/index
?
I assume because it would clash with the db indexes on collection (i.e. what Islam has been working on).
thought: colllection.Index
will also be a confusing name for that reason. Maybe we can find something else when we get round to looking at that.
Then again, I think this key prefix should align with that property, which means it should either be left as collection/id
for now, or we need to solve the collection.ID
property name problem here/now. Worth bearing in mind is that changing this is a breaking change, so it is better if we don't change it every other day.
praise: Thanks for bringing this up, is a good spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it would then create confusion with the secondary indexes. This is why I'm suggesting to use collection.Number.
Did I misread, or did you edit? 😅
I missed this when replying, number
feels a bit strange, but I have no better suggestions yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: colllection.Index will also be a confusing name for that reason. Maybe we can find something else when we get round to looking at that.
collection.Number
. would be fine as noted above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then again, I think this key prefix should align with that property, which means it should either be left as collection/id for now, or we need to solve the collection.ID property name problem here/now. Worth bearing in mind is that changing this is a breaking change, so it is better if we don't change it every other day.
I really think collection.ID
should be used for the global ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I misread, or did you edit? 😅
I did not edit :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed in stand-up, we are undecided but we set a deadline for noon Tuesday Ottawa time to pick something. Number
is still a bit weird, but otherwise received little criticism (unlike LocalID
, which could be confusing for a few reasons. If no consensus is reached by the deadline I will probably rename to Number
.
- Wait/discuss/rename to
Number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving as is for now, hopefully this can be removed in #1085
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something about the actual key string and though? Value isn't representative of the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, true sorry - I'll change it to collection/id
-
collection/id
err = txn.Systemstore().Put(ctx, key.ToDS(), buf) | ||
if err != nil { | ||
return client.CollectionDescription{}, err | ||
} | ||
|
||
idBuf, err := json.Marshal(desc.ID) | ||
if err != nil { | ||
return client.CollectionDescription{}, err | ||
} | ||
|
||
nameKey := core.NewCollectionNameKey(desc.Name) | ||
err = txn.Systemstore().Put(ctx, nameKey.ToDS(), idBuf) | ||
if err != nil { | ||
return client.CollectionDescription{}, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I would do the opposite here. I would store the description on the name key instead of the collection number key because when querying, users will do so using the collection name. This will prevent the extra key lookup on queries which is where speed is most important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do disagree with you here, for the following reasons:
- Name is mutable, and if the description is stored against it, renaming the collection becomes more expensive and error prone as we have to move the definition and re-index everything.
- The performance cost should be very very small, it is an extra single-point lookup.
- I find it conceptually to make more sense to store the thing against it's fixed ID, and index that, than to store it against a mutable property on it, and then have everything else reference the mutable property.
I think the maintainability of storing against the ID far outweigh the performance gains here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name is mutable, and if the description is stored against it, renaming the collection becomes more expensive and error prone as we have to move the definition and re-index everything.
That is an excellent point. I should have thought about that. Alright let's keep it as is then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I should have noted this one in the PR description I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I enjoyed all the improvement nuggets, cheers! As for the decision on picking a name for the Sequence, Index, ID, Number situation will the final decision be done in a later PR?
Nice :) Yes, that name can be revisited/removed when GlobalID gets implemented. |
I think we may have been dropping unpatched collections from the GQL types here. EDIT: I was incorrect, this was not a bug, it is working fine as we were passing in all the schema/cols anyway. I do however think this change is an improvement anyway and should be kept.
These two tests fail for mock-related reasons in a later commit, so I am removing them. Index related code has not changed.
- Some index uint test setup code was removed, it no longer compiles, and removing it does not cause any tests to fail. - A P2P (un)subscribe error was improved, and the location that gererated the error was moved due to the change in how collection/schema are fetched. - The order in which collections are returned for backup/export has changed, due to the changes in the collection store keys, and so the tests have changed - the only change to those tests is that User data is now returned/expected before Book - the data within each collection-set has not been modified. The best explaination for how the storage has changed is either the PR description or the issue.
This should have been changed in an earlier PR really, but now it is extra visible/ugly. These prefixes should now all make sense.
455692e
to
355bd56
Compare
"github.com/sourcenetwork/defradb/datastore" | ||
) | ||
|
||
func SaveCollection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: would be nice to have a short documentation for this and other methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A was going to point that out too. All the public functions should have documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the public functions should have documentation.
These arent public, they are internal to defra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are very much public at the moment though. They will be accessible to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not consider these to be public, and neither does at least one other dev. I also think that conversation is out of scope here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly disagree with this one. It's either public or it's not. The functions are accessible by users and will be found in the go docs. They should be documented. If we want to keep them internal, they can be under an internal
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed over discord, the internal
package conversation will be re-opened outside of this PR. I'll add some docs.
- Add docs for funcs
txn datastore.Txn, | ||
desc client.CollectionDescription, | ||
) (client.CollectionDescription, error) { | ||
buf, err := json.Marshal(desc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: buffer implies some kind of preallocated memory or something that deals with dynamic memory allocation, but this is just a slice of bytes.
Why not to call it descJSON
or descBytes
?
The same applies to idBuf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A go standard is also just to use b
in a case like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that short var names like this are only acceptable for very short functions (I'd say up to 6 lines of code) otherwise it doesn't do any good to readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me as long as you can see the whole function on a smallish screen. then it's fine. That probably means around 50 lines. But definitely more than 6. As long as there aren't many more byte arrays defined. That being said, it was only a option. Your suggestions are good as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use buf
all over the place in our codebase, and I'm pretty sure I've seen it for stuff like this in a fair few other codebases. Islam, internally within json.Marshal
the return value probably does fit your some kind of preallocated memory
description and that's how I've seen this naming come from.
Let me know if either of you think this is important, as I don't, and without clear consensus between you guys I'll leave it as is otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buf
is misrepresenting what this is but we do have a few places where we use it probably by mistake including one that Islam wrote
but most were added by Andy in #1965. buf
is usually used for bytes.Buffer
as we do a lot in the code base.
My opinion is that it's the wrong name for the variable but I wouldn't say its important to me in this case. It was good to point it out though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but most were added by Andy in #1965.
nitpick: If you look at db/collection.go
in that PR, you should see that these variables were fooBuf
before :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough 🙂
PRIMARY_KEY = "/pk" | ||
DATASTORE_DOC_VERSION_FIELD_ID = "v" | ||
REPLICATOR = "/replicator/id" | ||
P2P_COLLECTION = "/p2p/collection" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: You didn't want to add the following as well as mentioned on the issue?
SCHEMA_ROOT_COLLECTION = "/schema/root/collection" // "{rootSchemaID}/{collectionName} -> someEmptyMarker
SCHEMA_VERSION_COLLECTION = "/schema/version/collection" // "{schemaVersionID}/{collectionName} -> someEmptyMarker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that is out of scope, and I think it would be dead code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a public utility so not dead code. But I agree that it can be out-of-scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that you're already doing the same as SCHEMA_VERSION_COLLECTION
with SCHEMA_VERSION
.
db/description/collection.go
Outdated
colIDs = append(colIDs, colSchemaVersionKey.CollectionID) | ||
} | ||
|
||
cols := make([]client.CollectionDescription, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: make([]client.CollectionDescription, 0, colIDs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers :)
-
make([]client.CollectionDescription, len(colIDs))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: it's great that you moved all these highly cohesive functions into a separate package. It makes the rest of code much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great job, Andy!
newCollections, err := db.getAllCollections(ctx, txn) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
newCollections = append(newCollections, col.Definition()) | ||
definitions := make([]client.CollectionDefinition, len(newCollections)) | ||
for i, col := range newCollections { | ||
definitions[i] = col.Definition() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tought: It would be nice to have a method db.getAllCollectionDefinitions
that would return the definitions directly instead of having to loop through collections to get what we actually want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I think that would only be used here no? And if it wasn't it may end up coupling totally unrelated parts of the codebase together - is not like this loop is particularly complex and in need of standardisation.
Leaving as is unless significant push back received.
|
||
// The need for this key is temporary, we should replace it with the global collection ID | ||
// https://github.com/sourcenetwork/defradb/issues/1085 | ||
schemaVersionKey := core.NewCollectionSchemaVersionKey(desc.SchemaVersionID, desc.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tought: In its current form, this should probably not be removed when adding global collection ID since it will enable some nice DB admin feature where one can check which collections are at a given schema version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The replacement would handle that, there would be no loss of user/admin functionality.
## Relevant issue(s) Resolves sourcenetwork#1964 sourcenetwork#1913 ## Description Reorganises collection description storage so that it actually makes sense. The existing key/value setup has been replaced with the following keys: ``` [Collection.ID] => json // Json should be stored against an immutable index, i.e. not against `Name` [Collection.Name] => [Collection.ID] // This means collection names must be unique ``` We also need the following for the short-term, before Collection.GlobalID gets added, as otherwise if a collection is updated to a new schema version, we lose track of what collection it could be. The locations that use this have been linked to sourcenetwork#1085 ``` [Collection.SchemaVersionId]/[Collection.ID] => nil ``` With this change the storage setup should fully support multiple collections from a single schema, however this is blocked off by `setDefaultSchemaVersion` (it doesn't allow users to explicitly specify a collection, so atm there is no way to try and do this). Later when we introduce `patchCollection` we will need to be mindful of this and block off user's ability to do this (if done before sourcenetwork#1085).
Relevant issue(s)
Resolves #1964 #1913
Description
Reorganises collection description storage so that it actually makes sense.
The existing key/value setup has been replaced with the following keys:
We also need the following for the short-term, before Collection.GlobalID gets added, as otherwise if a collection is updated to a new schema version, we lose track of what collection it could be. The locations that use this have been linked to #1085
With this change the storage setup should fully support multiple collections from a single schema, however this is blocked off by
setDefaultSchemaVersion
(it doesn't allow users to explicitly specify a collection, so atm there is no way to try and do this). Later when we introducepatchCollection
we will need to be mindful of this and block off user's ability to do this (if done before #1085).Strongly recommend reviewing commit by commit, I've tried to add a fair bit of context in each commit message.