Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detangle collection storage #1964

Closed
Tracked by #1913
AndrewSisley opened this issue Oct 13, 2023 · 4 comments · Fixed by #1988
Closed
Tracked by #1913

Detangle collection storage #1964

AndrewSisley opened this issue Oct 13, 2023 · 4 comments · Fixed by #1988
Assignees
Labels
area/collections Related to the collections system code quality Related to improving code quality
Milestone

Comments

@AndrewSisley
Copy link
Contributor

AndrewSisley commented Oct 13, 2023

The way we store collections makes no sense at the moment, and needs to be sorted out.

It should probably be done after #1963 though.

There will be the following keys:

[Collection.ID] => json // Json should be stored against an immutable index, i.e. not against `Name`
[Collection.Name] => [Collection.ID]  // This means collection names must be unique

We also need the following for the short-term, before Collection.GlobalID gets added, as otherwise if a collection is updated to a new schema version, we lose track of what collection it could (could as, this only narrows it down) be.

[Collection.SchemaVersionId]/[Collection.ID] => nil
@AndrewSisley AndrewSisley added area/collections Related to the collections system code quality Related to improving code quality labels Oct 13, 2023
@jsimnz
Copy link
Member

jsimnz commented Oct 13, 2023

Can you add context here on what "no sense". The other issue regarding fetching has great additional info, let's spruce this one up too ;)

@AndrewSisley
Copy link
Contributor Author

Can you add context here on what "no sense". The other issue regarding fetching has great additional info, let's spruce this one up too ;)

I don't plan on letting this issue live for very long, and it is both tricky and inexact to explain, and in some sense already solved in the WIP branch 1963-collection-feat-detangle

The keys and their values used to persist and index the collection description make no sense. They are tangled, and with poor names and prefixes. They blend schema with collection very heavily.

@jsimnz
Copy link
Member

jsimnz commented Oct 13, 2023

sounds good!

@fredcarle
Copy link
Collaborator

Currently we have

COLLECTION                        = "/collection/names" // {collectionName} -> schemaVersionID
COLLECTION_SCHEMA                 = "/collection/schema" // {rootSchema} -> schemaVersionID
COLLECTION_SCHEMA_VERSION         = "/collection/version/v" // {schemaVersionID} -> client.CollectionDescription
COLLECTION_SCHEMA_VERSION_HISTORY = "/collection/version/h" // {rootSchemaID}/{previousSchemaVersionID} -> schemaVersionID

SCHEMA_VERSION                    = "/schema/version" // {schemaVersionID} -> client.SchemaDescription 

This is what I think we should have:

COLLECTION_NAME                   = "/collection/name" // {collectionName} -> client.CollectionDescription
COLLECTION_ID                     = "/collection/id" // {collectionID} -> collectionName

SCHEMA_VERSION                    = "/schema/version" // {schemaVersionID} -> client.SchemaDescription 
SCHEMA_VERSION_HISTORY            = "/schema/version/h" // {rootSchemaID}/{previousSchemaVersionID} -> schemaVersionID
// maybe also
SCHEMA_ROOT_COLLECTION            = "/schema/root/collection" // "{rootSchemaID}/{collectionName} -> someEmptyMarker
SCHEMA_VERSION_COLLECTION         = "/schema/version/collection" // "{schemaVersionID}/{collectionName} -> someEmptyMarker

User will interact most often with collections via their names so it makes sense to have COLLECTION_NAME point directly to the collection description.

The P2P system will interact with collections using their collection ID. In this case we have COLLECTION_ID pointing to the local collection name which we can use to get the collection description.

The schema description can be stored via SCHEMA_VERSION and as we have now, we can store the history via SCHEMA_VERSION_HISTORY

There are 2 other avenues that we may want to cover:
1- Getting all collections that were created with a given root schema.
2- Getting all collections that are at a given schema version.

AndrewSisley added a commit that referenced this issue Oct 17, 2023
## Relevant issue(s)

Resolves #1958

## Description

Removes CollectionDescription.Schema.

Also splits the storage of schema out from within collection.

The storage of schema has been broken out to a new sub-package of db, at
the moment it is a very simple file, but collection will be moved there
in #1964. I was planning
on doing that in this PR (in part, to provide context for reviewers, as
atm it is basically a single-file package), but it proved to be
non-trivial due to some existing messiness in that space and was broken
out to two more tasks.

I also wish for stuff in that directory to eventually follow a
repository-like pattern, where stuff is cached (within a context/txn's
context) instead of fetching from store on each call.

Moving this stuff out to a new directory instead of preserving it in the
(already very large) db directory should make both db and the new
sub-package a fair bit more cohesive and easier to read.
nasdf pushed a commit to nasdf/defradb that referenced this issue Oct 18, 2023
## Relevant issue(s)

Resolves sourcenetwork#1958

## Description

Removes CollectionDescription.Schema.

Also splits the storage of schema out from within collection.

The storage of schema has been broken out to a new sub-package of db, at
the moment it is a very simple file, but collection will be moved there
in sourcenetwork#1964. I was planning
on doing that in this PR (in part, to provide context for reviewers, as
atm it is basically a single-file package), but it proved to be
non-trivial due to some existing messiness in that space and was broken
out to two more tasks.

I also wish for stuff in that directory to eventually follow a
repository-like pattern, where stuff is cached (within a context/txn's
context) instead of fetching from store on each call.

Moving this stuff out to a new directory instead of preserving it in the
(already very large) db directory should make both db and the new
sub-package a fair bit more cohesive and easier to read.
@AndrewSisley AndrewSisley self-assigned this Oct 19, 2023
@AndrewSisley AndrewSisley added this to the DefraDB v0.8 milestone Oct 19, 2023
AndrewSisley added a commit that referenced this issue Oct 19, 2023
## Relevant issue(s)

Resolves #1963

## Description

Changes GetCollectionBySchemaFoo function signatures to return many
Collections, instead of single values.

Conceptually these functions were broken and relied on the fact that we
cannot yet link multiple collections to the same schema(version)(s).
This changes their signatures and the code that calls them to reflect
what they should do long term.

The code that calls these functions now hosts the 'broken' code. These
will still need to change once we add global IDs.

The implementation of these functions will change later in
#1964.
AndrewSisley added a commit that referenced this issue Oct 24, 2023
## Relevant issue(s)

Resolves #1964 #1913 

## Description

Reorganises collection description storage so that it actually makes
sense.

The existing key/value setup has been replaced with the following keys:
```
[Collection.ID] => json // Json should be stored against an immutable index, i.e. not against `Name`
[Collection.Name] => [Collection.ID]  // This means collection names must be unique
```
We also need the following for the short-term, before
Collection.GlobalID gets added, as otherwise if a collection is updated
to a new schema version, we lose track of what collection it could be.
The locations that use this have been linked to
#1085
```
[Collection.SchemaVersionId]/[Collection.ID] => nil
```

With this change the storage setup should fully support multiple
collections from a single schema, however this is blocked off by
`setDefaultSchemaVersion` (it doesn't allow users to explicitly specify
a collection, so atm there is no way to try and do this). Later when we
introduce `patchCollection` we will need to be mindful of this and block
off user's ability to do this (if done before #1085).
shahzadlone pushed a commit to shahzadlone/defradb that referenced this issue Feb 23, 2024
## Relevant issue(s)

Resolves sourcenetwork#1958

## Description

Removes CollectionDescription.Schema.

Also splits the storage of schema out from within collection.

The storage of schema has been broken out to a new sub-package of db, at
the moment it is a very simple file, but collection will be moved there
in sourcenetwork#1964. I was planning
on doing that in this PR (in part, to provide context for reviewers, as
atm it is basically a single-file package), but it proved to be
non-trivial due to some existing messiness in that space and was broken
out to two more tasks.

I also wish for stuff in that directory to eventually follow a
repository-like pattern, where stuff is cached (within a context/txn's
context) instead of fetching from store on each call.

Moving this stuff out to a new directory instead of preserving it in the
(already very large) db directory should make both db and the new
sub-package a fair bit more cohesive and easier to read.
shahzadlone pushed a commit to shahzadlone/defradb that referenced this issue Feb 23, 2024
…work#1984)

## Relevant issue(s)

Resolves sourcenetwork#1963

## Description

Changes GetCollectionBySchemaFoo function signatures to return many
Collections, instead of single values.

Conceptually these functions were broken and relied on the fact that we
cannot yet link multiple collections to the same schema(version)(s).
This changes their signatures and the code that calls them to reflect
what they should do long term.

The code that calls these functions now hosts the 'broken' code. These
will still need to change once we add global IDs.

The implementation of these functions will change later in
sourcenetwork#1964.
shahzadlone pushed a commit to shahzadlone/defradb that referenced this issue Feb 23, 2024
## Relevant issue(s)

Resolves sourcenetwork#1964 sourcenetwork#1913 

## Description

Reorganises collection description storage so that it actually makes
sense.

The existing key/value setup has been replaced with the following keys:
```
[Collection.ID] => json // Json should be stored against an immutable index, i.e. not against `Name`
[Collection.Name] => [Collection.ID]  // This means collection names must be unique
```
We also need the following for the short-term, before
Collection.GlobalID gets added, as otherwise if a collection is updated
to a new schema version, we lose track of what collection it could be.
The locations that use this have been linked to
sourcenetwork#1085
```
[Collection.SchemaVersionId]/[Collection.ID] => nil
```

With this change the storage setup should fully support multiple
collections from a single schema, however this is blocked off by
`setDefaultSchemaVersion` (it doesn't allow users to explicitly specify
a collection, so atm there is no way to try and do this). Later when we
introduce `patchCollection` we will need to be mindful of this and block
off user's ability to do this (if done before sourcenetwork#1085).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collections Related to the collections system code quality Related to improving code quality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants