-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: Add objstorage.shared.Storage interface for storing sstables #2231
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 15 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
compaction.go
line 2824 at r1 (raw file):
return nil, pendingOutputs, err } if err := vfs.CopyAcrossFS(fs, origPath, sharedFS, destPath); err != nil {
Would it be more efficient to write out the file directly on shared storage? Maybe leave a TODO
compaction.go
line 3035 at r1 (raw file):
// obsoleteFile holds information about a file that needs to be deleted soon. type obsoleteFile struct {
[nit] This looks like it should just contain a fileInfo
(either as a field or embedded)
options.go
line 634 at r1 (raw file):
// however Open() can set it to the last-set Unique ID if it finds one in // the Options file. UniqueID uint64
[nit] "unique id" sounds very generic, maybe DB ID or Store ID
internal/manifest/version.go
line 1112 at r1 (raw file):
for f := iter.First(); f != nil; f = iter.Next() { fileFS := fs path := base.MakeFilepath(fs, dirname, base.FileTypeTable, f.FileNum)
It feels that this logic needs to be encapsulated or we'll end up duplicating this kind of code in many places. Maybe a struct that contains fs, sharedFs, sharedDir
and implements a function that returns the path and the fs given afileInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 15 files at r1, all commit messages.
Reviewable status: 7 of 15 files reviewed, 9 unresolved discussions (waiting on @itsbilal and @RaduBerinde)
compaction.go
line 2815 at r1 (raw file):
// // TODO(bilal): Move this outputLevel >= 5 logic for determining whether // a file should become shared to elsewhere when it's more complex.
Should this be hidden in the sharedFS
implementation?
To tolerate transient unavailability of the shared storage it would absorb all writes locally and when the file was closed would copy it over. And if the copy did not succeed it would keep track in local persistent storage of what pending copies need to be done.
What I am suggesting may not be the right abstraction, but I do think we need to decompose the sub-systems in a way that we can manage complexity. We do a decomposition like this for the encryptedFS
, which has its own "manifest".
options.go
line 626 at r1 (raw file):
// not fully owned by this Pebble instance, necessitating more coordination // for deletes. It is also expected to have slower read/write performance // than FS.
How about being a bit more explicit in this comment about what it means to "not fully own". Something like:
SharedFS is for writing files that can later be read by others that are given the name of the file. There is a single writer for a file. Only sstables should be put in SharedFS.
The "necessitating more coordination" part is not clear. From the perspective of this interface, which is for use by Pebble, could we hide the coordination part behind an ImportFile(filename string)
method in a SharedFS
interface that extends vfs.FS
? So the contract would be that to share a file, the sender (original writer, or someone who has imported) would send over the filename and wait for the receiver to tell the sender that it has called ImportFile
. Once that has happened the sender can safely delete the file since the receiver has taken a reference. We of course will need an out-of-band mechanism to do cleanup when a store that has imported (or wrote originally) permanently disappears, but that cleanup mechanism would be a component of the implementation and not part of this interface.
options.go
line 630 at r1 (raw file):
// UniqueID is a unique ID that's generated for new Pebble instances and // serialized into the Options file. Used to disambiguate this instance's
the files written by this instance, yes?
table_cache.go
line 274 at r1 (raw file):
type tableCacheKey struct { cacheID uint64 fileNum FileNum
Doesn't this need to change since fileNum
is no longer sufficient as the key?
And for the block cache too?
internal/manifest/version.go
line 1112 at r1 (raw file):
Previously, RaduBerinde wrote…
It feels that this logic needs to be encapsulated or we'll end up duplicating this kind of code in many places. Maybe a struct that contains
fs, sharedFs, sharedDir
and implements a function that returns the path and the fs given afileInfo
+1
internal/manifest/version_edit.go
line 308 at r1 (raw file):
} onSharedFS = (field[0] == 1) uniqueID, err = d.readUvarint()
is it fair to say that the sstable identifier is now (fileNum: uint64, onSharedFS: bool, creatorUniqueID: uint64)?
If yes, would be worth highlighting with a code comment in the FileMetadata
declaration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 15 files reviewed, 9 unresolved discussions (waiting on @itsbilal and @RaduBerinde)
internal/manifest/version_edit.go
line 308 at r1 (raw file):
Previously, sumeerbhola wrote…
is it fair to say that the sstable identifier is now (fileNum: uint64, onSharedFS: bool, creatorUniqueID: uint64)?
If yes, would be worth highlighting with a code comment in theFileMetadata
declaration.
Another option (in the same vein of more abstraction boundaries).
SharedFS does a mapping from local filenum to shared filename:
- Is configured with which levels to share, or other more complicated policies
- Writer client tell sharedFS when opening a sstable for writing which level it corresponds to and a lower bound on the first key in the sstable. Latter can be used to exclude CockroachDB's local key space from sharing (in the future).
- SharedFS manages all the staging first to local filesystem and then to shared storage.
- Reader client (for sharing with other nodes) is aware of policies in SharedFS since it needs to know that for certain parts of the keyspace it should only read up to L4.
- Asks SharedFS for the (local filenum, shared filename) pairs for L5 and L6 in the relevant keyspace. If some of these files have not yet been written to shared storage (because of an outage), the SharedFS will block.
- Places a "hold" on deletion of these local filenums
- sends over the shared filenames
- Receiver client gives the shared filename to its local SharedFS which generates a local filenum (in a manner consistent with the filenum generation Pebble currently does, so as to not have collisions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! Addressed some of the comments, added to the discussions in others
Reviewable status: 0 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @RaduBerinde, and @sumeerbhola)
compaction.go
line 2815 at r1 (raw file):
Previously, sumeerbhola wrote…
Should this be hidden in the
sharedFS
implementation?
To tolerate transient unavailability of the shared storage it would absorb all writes locally and when the file was closed would copy it over. And if the copy did not succeed it would keep track in local persistent storage of what pending copies need to be done.
What I am suggesting may not be the right abstraction, but I do think we need to decompose the sub-systems in a way that we can manage complexity. We do a decomposition like this for theencryptedFS
, which has its own "manifest".
I like this idea, where sharedFS
could take care of falling back to local FS if shared FS is unavailable. There could be a "ReadyForShare()" method that confirms a file is in blob storage. I'll update the TODO for this.
compaction.go
line 2824 at r1 (raw file):
Previously, RaduBerinde wrote…
Would it be more efficient to write out the file directly on shared storage? Maybe leave a TODO
Yeah, it probably would be - especially behind a bufio.Writer
. Added a todo.
compaction.go
line 3035 at r1 (raw file):
Previously, RaduBerinde wrote…
[nit] This looks like it should just contain a
fileInfo
(either as a field or embedded)
Done.
options.go
line 626 at r1 (raw file):
Previously, sumeerbhola wrote…
How about being a bit more explicit in this comment about what it means to "not fully own". Something like:
SharedFS is for writing files that can later be read by others that are given the name of the file. There is a single writer for a file. Only sstables should be put in SharedFS.The "necessitating more coordination" part is not clear. From the perspective of this interface, which is for use by Pebble, could we hide the coordination part behind an
ImportFile(filename string)
method in aSharedFS
interface that extendsvfs.FS
? So the contract would be that to share a file, the sender (original writer, or someone who has imported) would send over the filename and wait for the receiver to tell the sender that it has calledImportFile
. Once that has happened the sender can safely delete the file since the receiver has taken a reference. We of course will need an out-of-band mechanism to do cleanup when a store that has imported (or wrote originally) permanently disappears, but that cleanup mechanism would be a component of the implementation and not part of this interface.
the sender would send over the s/filename/SharedSSTMeta
/ or some other struct that contains all the fields to resolve a path (and additional fields for bounds), but otherwise I agree. The cleanup mechanism is something @RaduBerinde is putting more thought into. I'm happy to add an ImportFile(filename string)
to an interface extending vfs.FS
if he's good with the idea of keeping this coordination in the sharedFS
. The other way would be to implement it in some other coordinator struct in Pebble that operates on a sharedFS
and just uses plain file handles to put down (and scan) markers.
I've made the comment change otherwise.
options.go
line 630 at r1 (raw file):
Previously, sumeerbhola wrote…
the files written by this instance, yes?
Done.
options.go
line 634 at r1 (raw file):
Previously, RaduBerinde wrote…
[nit] "unique id" sounds very generic, maybe DB ID or Store ID
Done.
table_cache.go
line 274 at r1 (raw file):
Previously, sumeerbhola wrote…
Doesn't this need to change since
fileNum
is no longer sufficient as the key?
And for the block cache too?
I'm going to let the virtual sstable changes handle any cache changes as it's going to be more relevant there, but this still works because each shared
sstable still gets a unique fileNum
for its own instance.
It'll probably be a good idea to have the fileNum
s here map to physical files if we have a locally-owned physical sstable backing the current virtual sstable, that way we can use the same Reader
for both the pre-excised and post-excised sstable. But that can come later.
internal/manifest/version.go
line 1112 at r1 (raw file):
Previously, sumeerbhola wrote…
+1
Elsewhere in the code it's a little more common to pass around iterators or fileMetadata
instead of file path parameters? I'll also be introducing a SharedSSTMeta
or SharedTableMeta
which implements many of these fields and will be exported by iterators. Does that ease your concern or should I do one just for the few path variables?
internal/manifest/version_edit.go
line 308 at r1 (raw file):
Previously, sumeerbhola wrote…
Another option (in the same vein of more abstraction boundaries).
SharedFS does a mapping from local filenum to shared filename:
- Is configured with which levels to share, or other more complicated policies
- Writer client tell sharedFS when opening a sstable for writing which level it corresponds to and a lower bound on the first key in the sstable. Latter can be used to exclude CockroachDB's local key space from sharing (in the future).
- SharedFS manages all the staging first to local filesystem and then to shared storage.
- Reader client (for sharing with other nodes) is aware of policies in SharedFS since it needs to know that for certain parts of the keyspace it should only read up to L4.
- Asks SharedFS for the (local filenum, shared filename) pairs for L5 and L6 in the relevant keyspace. If some of these files have not yet been written to shared storage (because of an outage), the SharedFS will block.
- Places a "hold" on deletion of these local filenums
- sends over the shared filenames
- Receiver client gives the shared filename to its local SharedFS which generates a local filenum (in a manner consistent with the filenum generation Pebble currently does, so as to not have collisions)
I think fileNum
is still a unique identifier for an SST for each pebble and that's unlikely to change. I don't want to over-legislate much here yet because Arjun is going to make many similar FileMetadata
changes when he lands virtual SSTs imminently so I'd rather build off of whatever comes in that PR. It's why I kept the pieces here to be the bare minimum necessary for having a sharedFS.
We will also inevitably have bounds logic tied to non-shared files (eg. an L0 file that gets excised away), and we'll need it for the ingestion case with virtual SSTs too that doesn't concern disaggregated storage at all, and where sharedFS
is nil.
I do like the idea of pushing down the level to the sharedFS, and that way the seqnum substitution could happen in an sst iterator that is created by sharedFS instead of one that's created by the table cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 15 files reviewed, 7 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)
internal/manifest/version_edit.go
line 308 at r1 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I think
fileNum
is still a unique identifier for an SST for each pebble and that's unlikely to change. I don't want to over-legislate much here yet because Arjun is going to make many similarFileMetadata
changes when he lands virtual SSTs imminently so I'd rather build off of whatever comes in that PR. It's why I kept the pieces here to be the bare minimum necessary for having a sharedFS.We will also inevitably have bounds logic tied to non-shared files (eg. an L0 file that gets excised away), and we'll need it for the ingestion case with virtual SSTs too that doesn't concern disaggregated storage at all, and where
sharedFS
is nil.I do like the idea of pushing down the level to the sharedFS, and that way the seqnum substitution could happen in an sst iterator that is created by sharedFS instead of one that's created by the table cache.
I think we can decouple the virtual sstable logic that will come from @bananabrick 's PR, from the filename mapping.
Pebble needs to know the virtual keyspans and also needs to know if the sstable is foreign (for the key seqnum stuff). The virtual sstable change will give us the former (which will map a virtual sst filenum to the base filenum), and we will need the add the latter.
But Pebble doesn't need to know how a locally assigned filenum that it has picked to be unique, either when creating a new file for writing or when calling import, maps to some actual blobname in S3 etc. It would be nice to define an interface for this SharedFS
in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 0 of 15 files reviewed, 7 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)
internal/manifest/version_edit.go
line 308 at r1 (raw file):
Writer client tell sharedFS when opening a sstable for writing which level it corresponds to and a lower bound on the first key in the sstable. Latter can be used to exclude CockroachDB's local key space from sharing (in the future).
This seems a little tricky, given sstables can be moved across levels. I'm a little wary of sharedFS directly understanding levels, because it muddles the VFS abstraction. LSM versioning means that a file doesn't have one level.
I +1 the notion of a thicker SharedFS which handles the mapping from local filenum to shared filename. Thinking through more abstraction boundaries, another option is accepting an Options.SecondaryStorage
option but keeping a single vfs.FS
field:
type SecondaryStorage interface {
// Offload records that the named file should be moved to secondary storage.
Offload(path string)
// Sync ensures that record of all previous calls to `Offload` are persisted. After
// Sync, a process restart must not forget to offload any files that were previously
// marked for offloading. Sync does NOT guarantee that the files themselves have
// have already been offloaded to secondary storage.
Sync() error
// IsOffloaded returns true if the named file is persisted on secondary storage. If
// true, the named file may also be available locally. If false, the named file MUST
// be available locally.
IsOffloaded(path string) bool
}
All file manipulation continues to go through Options.FS
. If SecondaryStorage
is provided, then Options.FS
must hold a shared FS (maybe implemented by the same type as the one implementing SecondaryStorage
). When Open
-ing a file, the sharedFS uses a local file handle if available or a blob-storage one if not. I don't think we're losing anything by forcing file creations to go to the local filesystem first, but curious your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 15 files reviewed, 7 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)
internal/manifest/version_edit.go
line 308 at r1 (raw file):
I don't think we're losing anything by forcing file creations to go to the local filesystem first, but curious your thoughts.
I don't think we want to impose that. I am keen to get rid of disk write bandwidth as a resource bottleneck when running with a "good" sharedFS implementation -- bandwidth provisioning is a major headache for our customers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews! Added a SharedFileHandle
and a beefier SharedFS
interface. PTAL.
Reviewable status: 0 of 16 files reviewed, 7 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)
internal/manifest/version_edit.go
line 308 at r1 (raw file):
Previously, sumeerbhola wrote…
I don't think we're losing anything by forcing file creations to go to the local filesystem first, but curious your thoughts.
I don't think we want to impose that. I am keen to get rid of disk write bandwidth as a resource bottleneck when running with a "good" sharedFS implementation -- bandwidth provisioning is a major headache for our customers.
I've added a SharedFileHandle
struct that SharedFS
uses to construct shared file paths, and that contains the two fields Pebble needs to be aware of (instance ID and file num). I didn't want to legislate a shared FS manifest right now because it sounds unnecessary to me when we really only need to hold two additional fields on top of what we already store in FileMetadata. If a shared FS manifest ends up being necessary, we can do that on our own behind the interface (and we might need to if we want to write-through to sharedFS while having a fallback to localFS within sharedFS
), but with the way the interface is right now, a stateless sharedFS
would also be an acceptable implementation.
The good thing is MakeSharedSSTFilePath
is gone now.
Ah, looks like #2186 + this change may have introduced a circular dependency in the changes to |
We need to pull back for a bit and think carefully about the entire abstraction. I don't think a filesystem-style interface is the right thing here. We should come up with a new "object storage" interface that contains the high level operations that we need (many of the FS-type operations aren't necessary). The implementation should be able to use cloud storage, or some future distributed storage/caching layer, without any changes in Pebble. As an example, the logic for the naming of shared files should live inside the implementation of that interface, not in Pebble. Likewise, if we want to use local FS to "buffer" data around cloud storage unavailability, all that should also be inside this implementation and not in Pebble. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to pull back for a bit and think carefully about the entire abstraction. I don't think a filesystem-style interface is the right thing here. We should come up with a new "object storage" interface that contains the high level operations that we need (many of the FS-type operations aren't necessary).
My interpretation of this is that SharedFS
should not extend FS
. We should include only the methods we need for sstable reading/writing, since those are the only things that live on the SharedFS
. In that sense I completely agree.
Reviewable status: 0 of 16 files reviewed, 11 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)
options.go
line 634 at r4 (raw file):
// uint64, however Open() can set it to the last-set Instance ID if it finds // one in the Options file. InstanceID uint64
CockroachDB's StoreID
type is an int32. Using a uint64 here seems fine, but I am curious if you think there are any performance implications of the choice being made here.
internal/manifest/version_edit.go
line 308 at r1 (raw file):
I didn't want to legislate a shared FS manifest right now because it sounds unnecessary to me when we really only need to hold two additional fields on top of what we already store in FileMetadata. ... but with the way the interface is right now, a stateless sharedFS would also be an acceptable implementation.
I can appreciate that viewpoint, but I'd prefer to get the interfaces to be what we expect them to be in the future. Storing the mapping is a quick implementation.
vfs/vfs.go
line 150 at r4 (raw file):
// SetInstanceID sets the instance ID of the current Pebble instance. Called // during Open(). Can be stored and/or used by the implementation to
Does this need to be called during Open
, or just before any other operations on the SharedFS
?
If it needs to be Open
, then I suppose we can't use the CockroachDB
storeId when this Pebble instance is embedded in CockroachDB. If it is a different ID, we will need to store the mapping somewhere. Any thoughts on these details?
vfs/vfs.go
line 155 at r4 (raw file):
// ResolvePath converts a SharedFileHandle to a file path that can // be used by FS methods for file manipulation/reading (eg. Create(), Open()).
There are 2 cases here:
- Creating a new file: We want to provide a local
FileNum
in the parameter, so that theSharedFS
can remember the mapping. TheCreatorInstanceID
is the one we have already set viaSetInstanceID
. - Opening an existing file: This could be a file created by this instance or by another instance. In either case, we have assigned it a local
FileNum
(for the latter case, that local FileNum assignment should happen inReference
). Again only the localFileNum
should be needed as a parameter.
But I think we should separate these two cases into different methods, such as
ResolveNewPath(fileNum uint64) string
ResolveExistingPath(fileNum uint64) string
so that the implementation can do proper error checking, and since only the former involves a write and fsync (to record the mapping).
vfs/vfs.go
line 164 at r4 (raw file):
// instance can call ResolvePath() and then Create() directly without // calling this method. Reference(fileHandle SharedFileHandle) error
I was expecting something like ReferenceForeign(resolvedPath string, localFileNum uint64) error
, where we are telling the SharedFS
to create a mapping from localFileNum
to resolvedPath
. The invariant is that the mapping cannot be changed, and asking to create the same mapping multiple times will succeed but there is still only 1 reference.
Then MarkObsolete
would become Dereference(fileNum uint64)
.
I think with these changes, Pebble only needs to remember (a) the local fileNum
, (b) whether a sstable is in the SharedFS
or a regular FS
(because in the future not all L5 and L6 files will be in the sharedFS, and even L0-L4 are likely to be in the sharedFS), (c) if in the SharedFS
, whether it is a foreign sst (for the seqnum remapping).
One question is whether even (b) should be something that Pebble should not bother remembering. It will remove demux code from Pebble which is better. In this world we would just have a new narrower FS
interface for ssts, which could be either all local, or mix of local and shared. The local impl would not support the ReferenceForeign
, and Dereference
would be deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 16 files reviewed, 11 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)
vfs/vfs.go
line 164 at r4 (raw file):
Previously, sumeerbhola wrote…
I was expecting something like
ReferenceForeign(resolvedPath string, localFileNum uint64) error
, where we are telling theSharedFS
to create a mapping fromlocalFileNum
toresolvedPath
. The invariant is that the mapping cannot be changed, and asking to create the same mapping multiple times will succeed but there is still only 1 reference.Then
MarkObsolete
would becomeDereference(fileNum uint64)
.I think with these changes, Pebble only needs to remember (a) the local
fileNum
, (b) whether a sstable is in theSharedFS
or a regularFS
(because in the future not all L5 and L6 files will be in the sharedFS, and even L0-L4 are likely to be in the sharedFS), (c) if in theSharedFS
, whether it is a foreign sst (for the seqnum remapping).One question is whether even (b) should be something that Pebble should not bother remembering. It will remove demux code from Pebble which is better. In this world we would just have a new narrower
FS
interface for ssts, which could be either all local, or mix of local and shared. The local impl would not support theReferenceForeign
, andDereference
would be deletion.
I am becoming more keen on having an sstFS
which includes APIs that are needed when backed by a shared FS implementation, and not having two FS
s for dealing with ssts.
What I am missing is a very clear understanding of the APIs needed for "data export", which will be used for sending range snapshots. It is relatively easy to add a supportsExportViaSharedFile() bool
method to this sstFS
, which the non-shared implementation will return false. But when it returns true, we need to do something to get the L5 and L6 ssts that we are sending over. Pebble already knows the filenums corresponding to those L5 and L6 ssts and it can manage their lifecycle by ensuring that Dereference
is not called. I suppose all that is needed is a ResolvePathShared(fileNum uint64) (string, error)
that will return an error if it cannot return a string name that can be sent to a remote node (say because the file is local).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My interpretation of this is that SharedFS should not extend FS. We should include only the methods we need for sstable reading/writing, since those are the only things that live on the SharedFS. In that sense I completely agree.
Yes. I've been looking through the code and it looks like extracting a more narrow interface just for SST storage would help even with just the current functionality - there is a lot of complication in the sstable code around write buffering, read-ahead, etc and these could all live behind the interface (inside the implementation). I will continue to work on that and hopefully have something out within a few days.
Reviewable status: 0 of 16 files reviewed, 11 unresolved discussions (waiting on @bananabrick, @itsbilal, and @sumeerbhola)
This is ready for another look. I've distilled a lot of the comments above and simplified the I've kept the instanceID but made it a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that (with the introduction of the objstorage
layer) this interface would no longer need to implement the entire vfs.File
. There are two high-level choices here:
-
Make the interface close to the Provider interface (i.e. have
Readable
/Writable
implementations); all the "middleware" (object obsoletion, readahead, caching) is implemented under the interface. -
Make the interface lower-level and very simple (close to
ExternalStorage
incockroach/cloud
) and implement all the middleware on top of it.
The shared object manifest conceptually belongs together with that middleware (e.g. object cleanup is at the end of the day making sure the object store state matches the manifest); there are also practical advantages to that - the manifest can help with the middleware, e.g. we can keep track of cleanup-related stuff in there.
For now I would very much prefer we go with 2, for the sake of ease of development - it will be a lot more clunky if all the middleware and the manifest has to be developed in the crdb repository. But we should make it possible to later pull the middleware and manifest out, switching to option 1 (if we choose to).
Reviewable status: 0 of 16 files reviewed, 11 unresolved discussions (waiting on @bananabrick, @itsbilal, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 16 files reviewed, 12 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)
options.go
line 630 at r5 (raw file):
SharedStorage vfs.SharedStorage // InstanceID is a unique ID that's generated for each Pebble instance and
In the demo, this was a random a 32-bit integer, which I think isn't enough randomness to prevent collisions. Eg, with ~3,000 nodes the odds of a collision is 1:1000. I think we'll want to ensure we have uniqueness confirmed by consensus. Can we lazily initialize the instance ID once we have the store ID available?
My original understanding was this would be under the interface:
And that this would be above the interface:
I'm happy to just import the cloud.ExternalStorage-like interface and build this all in Pebble for now, but I think we should leave a comment about pushing some of these features down to Cockroach. I can see logic about, say, "other node is in a different bucket in a different region so its path looks like this" living better in Cockroach than Pebble. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change, this PR is now just the blob storage driver interface. PTAL.
Reviewable status: 0 of 16 files reviewed, 12 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
options.go
line 634 at r4 (raw file):
Previously, sumeerbhola wrote…
CockroachDB's
StoreID
type is an int32. Using a uint64 here seems fine, but I am curious if you think there are any performance implications of the choice being made here.
Done.
options.go
line 630 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
In the demo, this was a random a 32-bit integer, which I think isn't enough randomness to prevent collisions. Eg, with ~3,000 nodes the odds of a collision is 1:1000. I think we'll want to ensure we have uniqueness confirmed by consensus. Can we lazily initialize the instance ID once we have the store ID available?
We probably could, but we'd have to go back and rename a bunch of files or so, and that seems like a lot of logistical work. I was hoping we could just lay down a marker for a shared ID, and as part of SetInstanceID
we could check if there's a marker in the shared storage for that ID and if there is one, we regenerate it. That should help a ton while still conserving some bits.
Either way, this is out of this PR as the sharedstorage interface is now just a blob storage driver.
vfs/vfs.go
line 150 at r4 (raw file):
Previously, sumeerbhola wrote…
Does this need to be called during
Open
, or just before any other operations on theSharedFS
?
If it needs to beOpen
, then I suppose we can't use theCockroachDB
storeId when this Pebble instance is embedded in CockroachDB. If it is a different ID, we will need to store the mapping somewhere. Any thoughts on these details?
Maybe the comment isn't clear, but this is Options.Experimental.InstanceID
. It won't be called by Cockroach into Pebble, rather it'll be called by Pebble itself during Open into this method.
The way the instance ID determination is written right now, Pebble can decide on it before Cockroach has arrived at a store ID in the first place. That's a critically important detail as sharedFS might need this ID before it can write any files itself.
vfs/vfs.go
line 155 at r4 (raw file):
Previously, sumeerbhola wrote…
There are 2 cases here:
- Creating a new file: We want to provide a local
FileNum
in the parameter, so that theSharedFS
can remember the mapping. TheCreatorInstanceID
is the one we have already set viaSetInstanceID
.- Opening an existing file: This could be a file created by this instance or by another instance. In either case, we have assigned it a local
FileNum
(for the latter case, that local FileNum assignment should happen inReference
). Again only the localFileNum
should be needed as a parameter.But I think we should separate these two cases into different methods, such as
ResolveNewPath(fileNum uint64) string ResolveExistingPath(fileNum uint64) string
so that the implementation can do proper error checking, and since only the former involves a write and fsync (to record the mapping).
Path resolution is now buried inside the SharedStorage.
vfs/vfs.go
line 164 at r4 (raw file):
Previously, sumeerbhola wrote…
I am becoming more keen on having an
sstFS
which includes APIs that are needed when backed by a shared FS implementation, and not having twoFS
s for dealing with ssts.What I am missing is a very clear understanding of the APIs needed for "data export", which will be used for sending range snapshots. It is relatively easy to add a
supportsExportViaSharedFile() bool
method to thissstFS
, which the non-shared implementation will return false. But when it returns true, we need to do something to get the L5 and L6 ssts that we are sending over. Pebble already knows the filenums corresponding to those L5 and L6 ssts and it can manage their lifecycle by ensuring thatDereference
is not called. I suppose all that is needed is aResolvePathShared(fileNum uint64) (string, error)
that will return an error if it cannot return a string name that can be sent to a remote node (say because the file is local).
Done (unnecessary as Provider will handle this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 16 files reviewed, 12 unresolved discussions (waiting on @bananabrick, @jbowens, and @sumeerbhola)
options.go
line 630 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
We probably could, but we'd have to go back and rename a bunch of files or so, and that seems like a lot of logistical work. I was hoping we could just lay down a marker for a shared ID, and as part of
SetInstanceID
we could check if there's a marker in the shared storage for that ID and if there is one, we regenerate it. That should help a ton while still conserving some bits.Either way, this is out of this PR as the sharedstorage interface is now just a blob storage driver.
If we just want to generate this as a random 64-bit integer, it would be easy enough to do as part of the shared object manifest (aka catalog). But it would be nice if it matched the store ID so there isn't another inscrutable ID that we'd need to map when debugging things. Realistically we'd need to delay creation of any L5/L6 object until the store ID is set (which I'm not sure is feasible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 13 files at r6.
Reviewable status: 1 of 16 files reviewed, 15 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)
vfs/vfs.go
line 184 at r6 (raw file):
io.Closer // ReadFile is shorthand for ReadFileAt with offset 0.
How about removing this shorthand. There is no special case of reading the start of a sstable.
vfs/vfs.go
line 200 at r6 (raw file):
// List enumerates files within the supplied prefix, returning a list of // objects within that prefix. If delimiter is non-empty names which have the
nit: ... non-empty, names ...
vfs/vfs.go
line 202 at r6 (raw file):
// objects within that prefix. If delimiter is non-empty names which have the // same prefix, prior to the delimiter, are grouped into a single result which // is that prefix. The order that results are returned is undefined.
So the prefix is used for filtering regardless of delimiter, but the delimiter is for grouping but not ordering? I am a bit confused. Can you give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 16 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @sumeerbhola)
vfs/vfs.go
line 178 at r6 (raw file):
// top of these methods. // // TODO(bilal): Consider reworking this interface to return File-like objects,
I would move this to objstorage
(maybe in its own subpackage, e.g. objstorage/sharedobj
) and rename File
to Object
in the methods.
I'd also remove this TODO - if we move the middleware behind an interface, it would be a new higher-level interface (fairly similar to the current Provider
public "interface"), and we would still use this interface inside that implementation.
vfs/vfs.go
line 190 at r6 (raw file):
ReadFileAt(basename string, offset int64) (io.ReadCloser, int64, error) // Writer returns a writer for the requested name.
This should explain the semantics when the object exists (I believe it will be overwritten?). Maybe CreateObject
is a better name than Writer
.
Also, I believe this is an all-or-nothing operation (at least on S3 - either there is a single PutObject
request, or a multi-part upload which only becomes effective with CompleteMultiPartUpload
). Is this something that would be useful to promise here? I don't think we care about that for SSTs, maybe we would for the blob feature @sumeerbhola is working on?
vfs/vfs.go
line 201 at r6 (raw file):
// List enumerates files within the supplied prefix, returning a list of // objects within that prefix. If delimiter is non-empty names which have the // same prefix, prior to the delimiter, are grouped into a single result which
Prior to the delimiter but after the given prefix
. Maybe a short example would help.
vfs/vfs.go
line 202 at r6 (raw file):
// objects within that prefix. If delimiter is non-empty names which have the // same prefix, prior to the delimiter, are grouped into a single result which // is that prefix. The order that results are returned is undefined.
How can the caller differentiate between an object or a group ("subdirectory")
vfs/vfs.go
line 202 at r6 (raw file):
Previously, sumeerbhola wrote…
So the prefix is used for filtering regardless of delimiter, but the delimiter is for grouping but not ordering? I am a bit confused. Can you give an example?
This is my understanding - say we have files
a
x/y/1
x/y/2
x/z/1
x/z/2
Say we use no prefix (prefix = ""
). If we don't use a delimiter, we get all of the above. If we use delimiter /
we get a
and x
. A set delimiter means we are listing "directory style"
Say we use prefix x/
. If we don't use a delimiter, we get all four that start with x/
. If we use delimiter /
we get x/y
and x/z
(or maybe y
and z
- I'm not sure if the result would contain the x/
prefix we provided or not).
AFAICT, there is nothing special about the delimiter used. From what I've read, S3 will treat all names as opaque strings but will use a prefix to break things up into shards internally (the first X chars, where X is chosen depending on how many different values there are for X). Which explains why in some cases you want to include hashed shard early in the filenames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 16 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)
vfs/vfs.go
line 190 at r6 (raw file):
Previously, RaduBerinde wrote…
This should explain the semantics when the object exists (I believe it will be overwritten?). Maybe
CreateObject
is a better name thanWriter
.Also, I believe this is an all-or-nothing operation (at least on S3 - either there is a single
PutObject
request, or a multi-part upload which only becomes effective withCompleteMultiPartUpload
). Is this something that would be useful to promise here? I don't think we care about that for SSTs, maybe we would for the blob feature @sumeerbhola is working on?
Blob files will be written the same way as SSTs.
vfs/vfs.go
line 202 at r6 (raw file):
Previously, RaduBerinde wrote…
This is my understanding - say we have files
a x/y/1 x/y/2 x/z/1 x/z/2
Say we use no prefix (
prefix = ""
). If we don't use a delimiter, we get all of the above. If we use delimiter/
we geta
andx
. A set delimiter means we are listing "directory style"Say we use prefix
x/
. If we don't use a delimiter, we get all four that start withx/
. If we use delimiter/
we getx/y
andx/z
(or maybey
andz
- I'm not sure if the result would contain thex/
prefix we provided or not).AFAICT, there is nothing special about the delimiter used. From what I've read, S3 will treat all names as opaque strings but will use a prefix to break things up into shards internally (the first X chars, where X is chosen depending on how many different values there are for X). Which explains why in some cases you want to include hashed shard early in the filenames.
Is this behavior specific to S3?
It sounds like specifying the delimiter means don't do a recursive search into the directory for files. The phrasing sounds confusing to me.
Do we have a use case for specifying the delimiter? If not, can we drop it from the interface until we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 16 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @sumeerbhola)
vfs/vfs.go
line 202 at r6 (raw file):
Previously, sumeerbhola wrote…
Is this behavior specific to S3?
It sounds like specifying the delimiter means don't do a recursive search into the directory for files. The phrasing sounds confusing to me.Do we have a use case for specifying the delimiter? If not, can we drop it from the interface until we need it.
google and azure support the same delimiter semantics (as does the CRDBcloud.ExternalStorage
interface).
I don't immediately see a usecase for us. We could efficiently list all "creator-id"s that have shared objects I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @sumeerbhola from 6 discussions.
Reviewable status: 1 of 16 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
vfs/vfs.go
line 178 at r6 (raw file):
Previously, RaduBerinde wrote…
I would move this to
objstorage
(maybe in its own subpackage, e.g.objstorage/sharedobj
) and renameFile
toObject
in the methods.I'd also remove this TODO - if we move the middleware behind an interface, it would be a new higher-level interface (fairly similar to the current
Provider
public "interface"), and we would still use this interface inside that implementation.
Moved. Also updated the TODO to just be about the latter part - pushing obsoletion behind it. It's just a TODO so we can always just drop it.
vfs/vfs.go
line 184 at r6 (raw file):
Previously, sumeerbhola wrote…
How about removing this shorthand. There is no special case of reading the start of a sstable.
Done.
vfs/vfs.go
line 201 at r6 (raw file):
Previously, RaduBerinde wrote…
Prior to the delimiter but after the given
prefix
. Maybe a short example would help.
Done.
vfs/vfs.go
line 202 at r6 (raw file):
Previously, RaduBerinde wrote…
google and azure support the same delimiter semantics (as does the CRDB
cloud.ExternalStorage
interface).I don't immediately see a usecase for us. We could efficiently list all "creator-id"s that have shared objects I guess.
Good point, added some examples. PTAL.
vfs/vfs.go
line 202 at r6 (raw file):
Previously, RaduBerinde wrote…
How can the caller differentiate between an object or a group ("subdirectory")
Not in an obvious way. The ReaderAt
and Size
call would fail if you called it at just a directory. Mimics how we use ExternalStorage
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 17 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
objstorage/shared/storage.go
line 44 at r7 (raw file):
// List("", "/") -> ["a", "b"] // List("b", "/") -> ["4", "5", "6"] // List("b", "") -> ["4", "5", "6"]
Shouldn't this be "/4", "/5", "/6"
?
This change adds a objstorage.shared.Storage interface that can be implemented by blob storage drivers. The objstorage.Provider coming in cockroachdb#2267 will largely be responsible for storing state of files' locations (i.e. shared or local), and calling into Storage as necessary.
Previously, RaduBerinde wrote…
Oops, yes. Fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 17 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
This change adds a Storage interface that can be implemented by blob storage drivers. The objstorage.Provider coming in #2267 will largely be responsible for storing state of files' locations (i.e. shared or local), and calling into Storage as necessary.