Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Add objstorage.shared.Storage interface for storing sstables #2231

Merged
merged 1 commit into from
Feb 7, 2023

Conversation

itsbilal
Copy link
Member

@itsbilal itsbilal commented Jan 10, 2023

This change adds a Storage interface that can be implemented by blob storage drivers. The objstorage.Provider coming in #2267 will largely be responsible for storing state of files' locations (i.e. shared or local), and calling into Storage as necessary.

@itsbilal itsbilal requested review from sumeerbhola and a team January 10, 2023 16:20
@itsbilal itsbilal self-assigned this Jan 10, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 15 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


compaction.go line 2824 at r1 (raw file):

				return nil, pendingOutputs, err
			}
			if err := vfs.CopyAcrossFS(fs, origPath, sharedFS, destPath); err != nil {

Would it be more efficient to write out the file directly on shared storage? Maybe leave a TODO


compaction.go line 3035 at r1 (raw file):

// obsoleteFile holds information about a file that needs to be deleted soon.
type obsoleteFile struct {

[nit] This looks like it should just contain a fileInfo (either as a field or embedded)


options.go line 634 at r1 (raw file):

		// however Open() can set it to the last-set Unique ID if it finds one in
		// the Options file.
		UniqueID uint64

[nit] "unique id" sounds very generic, maybe DB ID or Store ID


internal/manifest/version.go line 1112 at r1 (raw file):

		for f := iter.First(); f != nil; f = iter.Next() {
			fileFS := fs
			path := base.MakeFilepath(fs, dirname, base.FileTypeTable, f.FileNum)

It feels that this logic needs to be encapsulated or we'll end up duplicating this kind of code in many places. Maybe a struct that contains fs, sharedFs, sharedDir and implements a function that returns the path and the fs given afileInfo

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 15 files at r1, all commit messages.
Reviewable status: 7 of 15 files reviewed, 9 unresolved discussions (waiting on @itsbilal and @RaduBerinde)


compaction.go line 2815 at r1 (raw file):

		//
		// TODO(bilal): Move this outputLevel >= 5 logic for determining whether
		// a file should become shared to elsewhere when it's more complex.

Should this be hidden in the sharedFS implementation?
To tolerate transient unavailability of the shared storage it would absorb all writes locally and when the file was closed would copy it over. And if the copy did not succeed it would keep track in local persistent storage of what pending copies need to be done.
What I am suggesting may not be the right abstraction, but I do think we need to decompose the sub-systems in a way that we can manage complexity. We do a decomposition like this for the encryptedFS, which has its own "manifest".


options.go line 626 at r1 (raw file):

		// not fully owned by this Pebble instance, necessitating more coordination
		// for deletes. It is also expected to have slower read/write performance
		// than FS.

How about being a bit more explicit in this comment about what it means to "not fully own". Something like:
SharedFS is for writing files that can later be read by others that are given the name of the file. There is a single writer for a file. Only sstables should be put in SharedFS.

The "necessitating more coordination" part is not clear. From the perspective of this interface, which is for use by Pebble, could we hide the coordination part behind an ImportFile(filename string) method in a SharedFS interface that extends vfs.FS? So the contract would be that to share a file, the sender (original writer, or someone who has imported) would send over the filename and wait for the receiver to tell the sender that it has called ImportFile. Once that has happened the sender can safely delete the file since the receiver has taken a reference. We of course will need an out-of-band mechanism to do cleanup when a store that has imported (or wrote originally) permanently disappears, but that cleanup mechanism would be a component of the implementation and not part of this interface.


options.go line 630 at r1 (raw file):

		// UniqueID is a unique ID that's generated for new Pebble instances and
		// serialized into the Options file. Used to disambiguate this instance's

the files written by this instance, yes?


table_cache.go line 274 at r1 (raw file):

type tableCacheKey struct {
	cacheID uint64
	fileNum FileNum

Doesn't this need to change since fileNum is no longer sufficient as the key?
And for the block cache too?


internal/manifest/version.go line 1112 at r1 (raw file):

Previously, RaduBerinde wrote…

It feels that this logic needs to be encapsulated or we'll end up duplicating this kind of code in many places. Maybe a struct that contains fs, sharedFs, sharedDir and implements a function that returns the path and the fs given afileInfo

+1


internal/manifest/version_edit.go line 308 at r1 (raw file):

						}
						onSharedFS = (field[0] == 1)
						uniqueID, err = d.readUvarint()

is it fair to say that the sstable identifier is now (fileNum: uint64, onSharedFS: bool, creatorUniqueID: uint64)?
If yes, would be worth highlighting with a code comment in the FileMetadata declaration.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 15 files reviewed, 9 unresolved discussions (waiting on @itsbilal and @RaduBerinde)


internal/manifest/version_edit.go line 308 at r1 (raw file):

Previously, sumeerbhola wrote…

is it fair to say that the sstable identifier is now (fileNum: uint64, onSharedFS: bool, creatorUniqueID: uint64)?
If yes, would be worth highlighting with a code comment in the FileMetadata declaration.

Another option (in the same vein of more abstraction boundaries).

SharedFS does a mapping from local filenum to shared filename:

  • Is configured with which levels to share, or other more complicated policies
  • Writer client tell sharedFS when opening a sstable for writing which level it corresponds to and a lower bound on the first key in the sstable. Latter can be used to exclude CockroachDB's local key space from sharing (in the future).
    • SharedFS manages all the staging first to local filesystem and then to shared storage.
  • Reader client (for sharing with other nodes) is aware of policies in SharedFS since it needs to know that for certain parts of the keyspace it should only read up to L4.
    • Asks SharedFS for the (local filenum, shared filename) pairs for L5 and L6 in the relevant keyspace. If some of these files have not yet been written to shared storage (because of an outage), the SharedFS will block.
    • Places a "hold" on deletion of these local filenums
    • sends over the shared filenames
  • Receiver client gives the shared filename to its local SharedFS which generates a local filenum (in a manner consistent with the filenum generation Pebble currently does, so as to not have collisions)

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs! Addressed some of the comments, added to the discussions in others

Reviewable status: 0 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @RaduBerinde, and @sumeerbhola)


compaction.go line 2815 at r1 (raw file):

Previously, sumeerbhola wrote…

Should this be hidden in the sharedFS implementation?
To tolerate transient unavailability of the shared storage it would absorb all writes locally and when the file was closed would copy it over. And if the copy did not succeed it would keep track in local persistent storage of what pending copies need to be done.
What I am suggesting may not be the right abstraction, but I do think we need to decompose the sub-systems in a way that we can manage complexity. We do a decomposition like this for the encryptedFS, which has its own "manifest".

I like this idea, where sharedFS could take care of falling back to local FS if shared FS is unavailable. There could be a "ReadyForShare()" method that confirms a file is in blob storage. I'll update the TODO for this.


compaction.go line 2824 at r1 (raw file):

Previously, RaduBerinde wrote…

Would it be more efficient to write out the file directly on shared storage? Maybe leave a TODO

Yeah, it probably would be - especially behind a bufio.Writer. Added a todo.


compaction.go line 3035 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] This looks like it should just contain a fileInfo (either as a field or embedded)

Done.


options.go line 626 at r1 (raw file):

Previously, sumeerbhola wrote…

How about being a bit more explicit in this comment about what it means to "not fully own". Something like:
SharedFS is for writing files that can later be read by others that are given the name of the file. There is a single writer for a file. Only sstables should be put in SharedFS.

The "necessitating more coordination" part is not clear. From the perspective of this interface, which is for use by Pebble, could we hide the coordination part behind an ImportFile(filename string) method in a SharedFS interface that extends vfs.FS? So the contract would be that to share a file, the sender (original writer, or someone who has imported) would send over the filename and wait for the receiver to tell the sender that it has called ImportFile. Once that has happened the sender can safely delete the file since the receiver has taken a reference. We of course will need an out-of-band mechanism to do cleanup when a store that has imported (or wrote originally) permanently disappears, but that cleanup mechanism would be a component of the implementation and not part of this interface.

the sender would send over the s/filename/SharedSSTMeta/ or some other struct that contains all the fields to resolve a path (and additional fields for bounds), but otherwise I agree. The cleanup mechanism is something @RaduBerinde is putting more thought into. I'm happy to add an ImportFile(filename string) to an interface extending vfs.FS if he's good with the idea of keeping this coordination in the sharedFS. The other way would be to implement it in some other coordinator struct in Pebble that operates on a sharedFS and just uses plain file handles to put down (and scan) markers.

I've made the comment change otherwise.


options.go line 630 at r1 (raw file):

Previously, sumeerbhola wrote…

the files written by this instance, yes?

Done.


options.go line 634 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] "unique id" sounds very generic, maybe DB ID or Store ID

Done.


table_cache.go line 274 at r1 (raw file):

Previously, sumeerbhola wrote…

Doesn't this need to change since fileNum is no longer sufficient as the key?
And for the block cache too?

I'm going to let the virtual sstable changes handle any cache changes as it's going to be more relevant there, but this still works because each shared sstable still gets a unique fileNum for its own instance.

It'll probably be a good idea to have the fileNums here map to physical files if we have a locally-owned physical sstable backing the current virtual sstable, that way we can use the same Reader for both the pre-excised and post-excised sstable. But that can come later.


internal/manifest/version.go line 1112 at r1 (raw file):

Previously, sumeerbhola wrote…

+1

Elsewhere in the code it's a little more common to pass around iterators or fileMetadata instead of file path parameters? I'll also be introducing a SharedSSTMeta or SharedTableMeta which implements many of these fields and will be exported by iterators. Does that ease your concern or should I do one just for the few path variables?


internal/manifest/version_edit.go line 308 at r1 (raw file):

Previously, sumeerbhola wrote…

Another option (in the same vein of more abstraction boundaries).

SharedFS does a mapping from local filenum to shared filename:

  • Is configured with which levels to share, or other more complicated policies
  • Writer client tell sharedFS when opening a sstable for writing which level it corresponds to and a lower bound on the first key in the sstable. Latter can be used to exclude CockroachDB's local key space from sharing (in the future).
    • SharedFS manages all the staging first to local filesystem and then to shared storage.
  • Reader client (for sharing with other nodes) is aware of policies in SharedFS since it needs to know that for certain parts of the keyspace it should only read up to L4.
    • Asks SharedFS for the (local filenum, shared filename) pairs for L5 and L6 in the relevant keyspace. If some of these files have not yet been written to shared storage (because of an outage), the SharedFS will block.
    • Places a "hold" on deletion of these local filenums
    • sends over the shared filenames
  • Receiver client gives the shared filename to its local SharedFS which generates a local filenum (in a manner consistent with the filenum generation Pebble currently does, so as to not have collisions)

I think fileNum is still a unique identifier for an SST for each pebble and that's unlikely to change. I don't want to over-legislate much here yet because Arjun is going to make many similar FileMetadata changes when he lands virtual SSTs imminently so I'd rather build off of whatever comes in that PR. It's why I kept the pieces here to be the bare minimum necessary for having a sharedFS.

We will also inevitably have bounds logic tied to non-shared files (eg. an L0 file that gets excised away), and we'll need it for the ingestion case with virtual SSTs too that doesn't concern disaggregated storage at all, and where sharedFS is nil.

I do like the idea of pushing down the level to the sharedFS, and that way the seqnum substitution could happen in an sst iterator that is created by sharedFS instead of one that's created by the table cache.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 15 files reviewed, 7 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)


internal/manifest/version_edit.go line 308 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I think fileNum is still a unique identifier for an SST for each pebble and that's unlikely to change. I don't want to over-legislate much here yet because Arjun is going to make many similar FileMetadata changes when he lands virtual SSTs imminently so I'd rather build off of whatever comes in that PR. It's why I kept the pieces here to be the bare minimum necessary for having a sharedFS.

We will also inevitably have bounds logic tied to non-shared files (eg. an L0 file that gets excised away), and we'll need it for the ingestion case with virtual SSTs too that doesn't concern disaggregated storage at all, and where sharedFS is nil.

I do like the idea of pushing down the level to the sharedFS, and that way the seqnum substitution could happen in an sst iterator that is created by sharedFS instead of one that's created by the table cache.

I think we can decouple the virtual sstable logic that will come from @bananabrick 's PR, from the filename mapping.

Pebble needs to know the virtual keyspans and also needs to know if the sstable is foreign (for the key seqnum stuff). The virtual sstable change will give us the former (which will map a virtual sst filenum to the base filenum), and we will need the add the latter.

But Pebble doesn't need to know how a locally assigned filenum that it has picked to be unique, either when creating a new file for writing or when calling import, maps to some actual blobname in S3 etc. It would be nice to define an interface for this SharedFS in this PR.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 15 files reviewed, 7 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)


internal/manifest/version_edit.go line 308 at r1 (raw file):

Writer client tell sharedFS when opening a sstable for writing which level it corresponds to and a lower bound on the first key in the sstable. Latter can be used to exclude CockroachDB's local key space from sharing (in the future).

This seems a little tricky, given sstables can be moved across levels. I'm a little wary of sharedFS directly understanding levels, because it muddles the VFS abstraction. LSM versioning means that a file doesn't have one level.

I +1 the notion of a thicker SharedFS which handles the mapping from local filenum to shared filename. Thinking through more abstraction boundaries, another option is accepting an Options.SecondaryStorage option but keeping a single vfs.FS field:

type SecondaryStorage interface {
    // Offload records that the named file should be moved to secondary storage.
    Offload(path string)
    // Sync ensures that record of all previous calls to `Offload` are persisted. After
    // Sync, a process restart must not forget to offload any files that were previously
    // marked for offloading. Sync does NOT guarantee that the files themselves have
    // have already been offloaded to secondary storage.
    Sync() error
    // IsOffloaded returns true if the named file is persisted on secondary storage. If
    // true, the named file may also be available locally. If false, the named file MUST
    // be available locally.
    IsOffloaded(path string) bool
}

All file manipulation continues to go through Options.FS. If SecondaryStorage is provided, then Options.FS must hold a shared FS (maybe implemented by the same type as the one implementing SecondaryStorage). When Open-ing a file, the sharedFS uses a local file handle if available or a blob-storage one if not. I don't think we're losing anything by forcing file creations to go to the local filesystem first, but curious your thoughts.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 15 files reviewed, 7 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)


internal/manifest/version_edit.go line 308 at r1 (raw file):

I don't think we're losing anything by forcing file creations to go to the local filesystem first, but curious your thoughts.

I don't think we want to impose that. I am keen to get rid of disk write bandwidth as a resource bottleneck when running with a "good" sharedFS implementation -- bandwidth provisioning is a major headache for our customers.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews! Added a SharedFileHandle and a beefier SharedFS interface. PTAL.

Reviewable status: 0 of 16 files reviewed, 7 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)


internal/manifest/version_edit.go line 308 at r1 (raw file):

Previously, sumeerbhola wrote…

I don't think we're losing anything by forcing file creations to go to the local filesystem first, but curious your thoughts.

I don't think we want to impose that. I am keen to get rid of disk write bandwidth as a resource bottleneck when running with a "good" sharedFS implementation -- bandwidth provisioning is a major headache for our customers.

I've added a SharedFileHandle struct that SharedFS uses to construct shared file paths, and that contains the two fields Pebble needs to be aware of (instance ID and file num). I didn't want to legislate a shared FS manifest right now because it sounds unnecessary to me when we really only need to hold two additional fields on top of what we already store in FileMetadata. If a shared FS manifest ends up being necessary, we can do that on our own behind the interface (and we might need to if we want to write-through to sharedFS while having a fallback to localFS within sharedFS), but with the way the interface is right now, a stateless sharedFS would also be an acceptable implementation.

The good thing is MakeSharedSSTFilePath is gone now.

@itsbilal
Copy link
Member Author

Ah, looks like #2186 + this change may have introduced a circular dependency in the changes to Open() here. Let me take a look.

@RaduBerinde
Copy link
Member

What I am suggesting may not be the right abstraction, but I do think we need to decompose the sub-systems in a way that we can manage complexity.

We need to pull back for a bit and think carefully about the entire abstraction. I don't think a filesystem-style interface is the right thing here. We should come up with a new "object storage" interface that contains the high level operations that we need (many of the FS-type operations aren't necessary). The implementation should be able to use cloud storage, or some future distributed storage/caching layer, without any changes in Pebble. As an example, the logic for the naming of shared files should live inside the implementation of that interface, not in Pebble. Likewise, if we want to use local FS to "buffer" data around cloud storage unavailability, all that should also be inside this implementation and not in Pebble.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pull back for a bit and think carefully about the entire abstraction. I don't think a filesystem-style interface is the right thing here. We should come up with a new "object storage" interface that contains the high level operations that we need (many of the FS-type operations aren't necessary).

My interpretation of this is that SharedFS should not extend FS. We should include only the methods we need for sstable reading/writing, since those are the only things that live on the SharedFS. In that sense I completely agree.

Reviewable status: 0 of 16 files reviewed, 11 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)


options.go line 634 at r4 (raw file):

		// uint64, however Open() can set it to the last-set Instance ID if it finds
		// one in the Options file.
		InstanceID uint64

CockroachDB's StoreID type is an int32. Using a uint64 here seems fine, but I am curious if you think there are any performance implications of the choice being made here.


internal/manifest/version_edit.go line 308 at r1 (raw file):

I didn't want to legislate a shared FS manifest right now because it sounds unnecessary to me when we really only need to hold two additional fields on top of what we already store in FileMetadata. ... but with the way the interface is right now, a stateless sharedFS would also be an acceptable implementation.

I can appreciate that viewpoint, but I'd prefer to get the interfaces to be what we expect them to be in the future. Storing the mapping is a quick implementation.


vfs/vfs.go line 150 at r4 (raw file):

	// SetInstanceID sets the instance ID of the current Pebble instance. Called
	// during Open(). Can be stored and/or used by the implementation to

Does this need to be called during Open, or just before any other operations on the SharedFS?
If it needs to be Open, then I suppose we can't use the CockroachDB storeId when this Pebble instance is embedded in CockroachDB. If it is a different ID, we will need to store the mapping somewhere. Any thoughts on these details?


vfs/vfs.go line 155 at r4 (raw file):

	// ResolvePath converts a SharedFileHandle to a file path that can
	// be used by FS methods for file manipulation/reading (eg. Create(), Open()).

There are 2 cases here:

  • Creating a new file: We want to provide a local FileNum in the parameter, so that the SharedFS can remember the mapping. The CreatorInstanceID is the one we have already set via SetInstanceID.
  • Opening an existing file: This could be a file created by this instance or by another instance. In either case, we have assigned it a local FileNum (for the latter case, that local FileNum assignment should happen in Reference). Again only the local FileNum should be needed as a parameter.

But I think we should separate these two cases into different methods, such as

ResolveNewPath(fileNum uint64) string
ResolveExistingPath(fileNum uint64) string

so that the implementation can do proper error checking, and since only the former involves a write and fsync (to record the mapping).


vfs/vfs.go line 164 at r4 (raw file):

	// instance can call ResolvePath() and then Create() directly without
	// calling this method.
	Reference(fileHandle SharedFileHandle) error

I was expecting something like ReferenceForeign(resolvedPath string, localFileNum uint64) error, where we are telling the SharedFS to create a mapping from localFileNum to resolvedPath. The invariant is that the mapping cannot be changed, and asking to create the same mapping multiple times will succeed but there is still only 1 reference.

Then MarkObsolete would become Dereference(fileNum uint64).

I think with these changes, Pebble only needs to remember (a) the local fileNum, (b) whether a sstable is in the SharedFS or a regular FS (because in the future not all L5 and L6 files will be in the sharedFS, and even L0-L4 are likely to be in the sharedFS), (c) if in the SharedFS, whether it is a foreign sst (for the seqnum remapping).

One question is whether even (b) should be something that Pebble should not bother remembering. It will remove demux code from Pebble which is better. In this world we would just have a new narrower FS interface for ssts, which could be either all local, or mix of local and shared. The local impl would not support the ReferenceForeign, and Dereference would be deletion.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 16 files reviewed, 11 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)


vfs/vfs.go line 164 at r4 (raw file):

Previously, sumeerbhola wrote…

I was expecting something like ReferenceForeign(resolvedPath string, localFileNum uint64) error, where we are telling the SharedFS to create a mapping from localFileNum to resolvedPath. The invariant is that the mapping cannot be changed, and asking to create the same mapping multiple times will succeed but there is still only 1 reference.

Then MarkObsolete would become Dereference(fileNum uint64).

I think with these changes, Pebble only needs to remember (a) the local fileNum, (b) whether a sstable is in the SharedFS or a regular FS (because in the future not all L5 and L6 files will be in the sharedFS, and even L0-L4 are likely to be in the sharedFS), (c) if in the SharedFS, whether it is a foreign sst (for the seqnum remapping).

One question is whether even (b) should be something that Pebble should not bother remembering. It will remove demux code from Pebble which is better. In this world we would just have a new narrower FS interface for ssts, which could be either all local, or mix of local and shared. The local impl would not support the ReferenceForeign, and Dereference would be deletion.

I am becoming more keen on having an sstFS which includes APIs that are needed when backed by a shared FS implementation, and not having two FSs for dealing with ssts.

What I am missing is a very clear understanding of the APIs needed for "data export", which will be used for sending range snapshots. It is relatively easy to add a supportsExportViaSharedFile() bool method to this sstFS, which the non-shared implementation will return false. But when it returns true, we need to do something to get the L5 and L6 ssts that we are sending over. Pebble already knows the filenums corresponding to those L5 and L6 ssts and it can manage their lifecycle by ensuring that Dereference is not called. I suppose all that is needed is a ResolvePathShared(fileNum uint64) (string, error) that will return an error if it cannot return a string name that can be sent to a remote node (say because the file is local).

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation of this is that SharedFS should not extend FS. We should include only the methods we need for sstable reading/writing, since those are the only things that live on the SharedFS. In that sense I completely agree.

Yes. I've been looking through the code and it looks like extracting a more narrow interface just for SST storage would help even with just the current functionality - there is a lot of complication in the sstable code around write buffering, read-ahead, etc and these could all live behind the interface (inside the implementation). I will continue to work on that and hopefully have something out within a few days.

Reviewable status: 0 of 16 files reviewed, 11 unresolved discussions (waiting on @bananabrick, @itsbilal, and @sumeerbhola)

@itsbilal itsbilal changed the title *: Add support for a shared FS/dir that stores some sstables *: Add vfs.SharedStorage interface for storing sstables Feb 1, 2023
@itsbilal
Copy link
Member Author

itsbilal commented Feb 1, 2023

This is ready for another look. I've distilled a lot of the comments above and simplified the SharedStorage interface into one that's just FS-like and is only responsible for interacting directly with blob storage and handling obsoletion of files (open to thoughts on pulling that out of the interface and letting objstorage.Provider do it). With #2267 implementing a better middleware for handling state on where each file lives, I can see SharedFilePath being something that the objstorage.Provider can store in its manifest / generate itself.

I've kept the instanceID but made it a uint32 for performance / size reasons. I think it's still a good idea to have it at this level and it seems to be in line with Radu's RFC.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that (with the introduction of the objstorage layer) this interface would no longer need to implement the entire vfs.File. There are two high-level choices here:

  1. Make the interface close to the Provider interface (i.e. have Readable/Writable implementations); all the "middleware" (object obsoletion, readahead, caching) is implemented under the interface.

  2. Make the interface lower-level and very simple (close to ExternalStorage in cockroach/cloud) and implement all the middleware on top of it.

The shared object manifest conceptually belongs together with that middleware (e.g. object cleanup is at the end of the day making sure the object store state matches the manifest); there are also practical advantages to that - the manifest can help with the middleware, e.g. we can keep track of cleanup-related stuff in there.

For now I would very much prefer we go with 2, for the sake of ease of development - it will be a lot more clunky if all the middleware and the manifest has to be developed in the crdb repository. But we should make it possible to later pull the middleware and manifest out, switching to option 1 (if we choose to).

Reviewable status: 0 of 16 files reviewed, 11 unresolved discussions (waiting on @bananabrick, @itsbilal, and @sumeerbhola)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 16 files reviewed, 12 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)


options.go line 630 at r5 (raw file):

		SharedStorage vfs.SharedStorage

		// InstanceID is a unique ID that's generated for each Pebble instance and

In the demo, this was a random a 32-bit integer, which I think isn't enough randomness to prevent collisions. Eg, with ~3,000 nodes the odds of a collision is 1:1000. I think we'll want to ensure we have uniqueness confirmed by consensus. Can we lazily initialize the instance ID once we have the store ID available?

@itsbilal
Copy link
Member Author

itsbilal commented Feb 1, 2023

My original understanding was this would be under the interface:

  1. Path generation
  2. Obsoletion and marker-placement
  3. The blob storage driver
  4. Fake FS-level readahead

And that this would be above the interface:

  1. Demultiplexing of local vs shared storage and storing state of which file is where (or in both)
  2. Shared file manifest (that stores above state)
  3. Pebble-level readahead (might only apply if Fd is valid)

I'm happy to just import the cloud.ExternalStorage-like interface and build this all in Pebble for now, but I think we should leave a comment about pushing some of these features down to Cockroach. I can see logic about, say, "other node is in a different bucket in a different region so its path looks like this" living better in Cockroach than Pebble.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change, this PR is now just the blob storage driver interface. PTAL.

Reviewable status: 0 of 16 files reviewed, 12 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


options.go line 634 at r4 (raw file):

Previously, sumeerbhola wrote…

CockroachDB's StoreID type is an int32. Using a uint64 here seems fine, but I am curious if you think there are any performance implications of the choice being made here.

Done.


options.go line 630 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

In the demo, this was a random a 32-bit integer, which I think isn't enough randomness to prevent collisions. Eg, with ~3,000 nodes the odds of a collision is 1:1000. I think we'll want to ensure we have uniqueness confirmed by consensus. Can we lazily initialize the instance ID once we have the store ID available?

We probably could, but we'd have to go back and rename a bunch of files or so, and that seems like a lot of logistical work. I was hoping we could just lay down a marker for a shared ID, and as part of SetInstanceID we could check if there's a marker in the shared storage for that ID and if there is one, we regenerate it. That should help a ton while still conserving some bits.

Either way, this is out of this PR as the sharedstorage interface is now just a blob storage driver.


vfs/vfs.go line 150 at r4 (raw file):

Previously, sumeerbhola wrote…

Does this need to be called during Open, or just before any other operations on the SharedFS?
If it needs to be Open, then I suppose we can't use the CockroachDB storeId when this Pebble instance is embedded in CockroachDB. If it is a different ID, we will need to store the mapping somewhere. Any thoughts on these details?

Maybe the comment isn't clear, but this is Options.Experimental.InstanceID. It won't be called by Cockroach into Pebble, rather it'll be called by Pebble itself during Open into this method.

The way the instance ID determination is written right now, Pebble can decide on it before Cockroach has arrived at a store ID in the first place. That's a critically important detail as sharedFS might need this ID before it can write any files itself.


vfs/vfs.go line 155 at r4 (raw file):

Previously, sumeerbhola wrote…

There are 2 cases here:

  • Creating a new file: We want to provide a local FileNum in the parameter, so that the SharedFS can remember the mapping. The CreatorInstanceID is the one we have already set via SetInstanceID.
  • Opening an existing file: This could be a file created by this instance or by another instance. In either case, we have assigned it a local FileNum (for the latter case, that local FileNum assignment should happen in Reference). Again only the local FileNum should be needed as a parameter.

But I think we should separate these two cases into different methods, such as

ResolveNewPath(fileNum uint64) string
ResolveExistingPath(fileNum uint64) string

so that the implementation can do proper error checking, and since only the former involves a write and fsync (to record the mapping).

Path resolution is now buried inside the SharedStorage.


vfs/vfs.go line 164 at r4 (raw file):

Previously, sumeerbhola wrote…

I am becoming more keen on having an sstFS which includes APIs that are needed when backed by a shared FS implementation, and not having two FSs for dealing with ssts.

What I am missing is a very clear understanding of the APIs needed for "data export", which will be used for sending range snapshots. It is relatively easy to add a supportsExportViaSharedFile() bool method to this sstFS, which the non-shared implementation will return false. But when it returns true, we need to do something to get the L5 and L6 ssts that we are sending over. Pebble already knows the filenums corresponding to those L5 and L6 ssts and it can manage their lifecycle by ensuring that Dereference is not called. I suppose all that is needed is a ResolvePathShared(fileNum uint64) (string, error) that will return an error if it cannot return a string name that can be sent to a remote node (say because the file is local).

Done (unnecessary as Provider will handle this).

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 16 files reviewed, 12 unresolved discussions (waiting on @bananabrick, @jbowens, and @sumeerbhola)


options.go line 630 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

We probably could, but we'd have to go back and rename a bunch of files or so, and that seems like a lot of logistical work. I was hoping we could just lay down a marker for a shared ID, and as part of SetInstanceID we could check if there's a marker in the shared storage for that ID and if there is one, we regenerate it. That should help a ton while still conserving some bits.

Either way, this is out of this PR as the sharedstorage interface is now just a blob storage driver.

If we just want to generate this as a random 64-bit integer, it would be easy enough to do as part of the shared object manifest (aka catalog). But it would be nice if it matched the store ID so there isn't another inscrutable ID that we'd need to map when debugging things. Realistically we'd need to delay creation of any L5/L6 object until the store ID is set (which I'm not sure is feasible)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 13 files at r6.
Reviewable status: 1 of 16 files reviewed, 15 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)


vfs/vfs.go line 184 at r6 (raw file):

	io.Closer

	// ReadFile is shorthand for ReadFileAt with offset 0.

How about removing this shorthand. There is no special case of reading the start of a sstable.


vfs/vfs.go line 200 at r6 (raw file):

	// List enumerates files within the supplied prefix, returning a list of
	// objects within that prefix. If delimiter is non-empty names which have the

nit: ... non-empty, names ...


vfs/vfs.go line 202 at r6 (raw file):

	// objects within that prefix. If delimiter is non-empty names which have the
	// same prefix, prior to the delimiter, are grouped into a single result which
	// is that prefix. The order that results are returned is undefined.

So the prefix is used for filtering regardless of delimiter, but the delimiter is for grouping but not ordering? I am a bit confused. Can you give an example?

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 16 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @sumeerbhola)


vfs/vfs.go line 178 at r6 (raw file):

// top of these methods.
//
// TODO(bilal): Consider reworking this interface to return File-like objects,

I would move this to objstorage (maybe in its own subpackage, e.g. objstorage/sharedobj) and rename File to Object in the methods.

I'd also remove this TODO - if we move the middleware behind an interface, it would be a new higher-level interface (fairly similar to the current Provider public "interface"), and we would still use this interface inside that implementation.


vfs/vfs.go line 190 at r6 (raw file):

	ReadFileAt(basename string, offset int64) (io.ReadCloser, int64, error)

	// Writer returns a writer for the requested name.

This should explain the semantics when the object exists (I believe it will be overwritten?). Maybe CreateObject is a better name than Writer.

Also, I believe this is an all-or-nothing operation (at least on S3 - either there is a single PutObject request, or a multi-part upload which only becomes effective with CompleteMultiPartUpload). Is this something that would be useful to promise here? I don't think we care about that for SSTs, maybe we would for the blob feature @sumeerbhola is working on?


vfs/vfs.go line 201 at r6 (raw file):

	// List enumerates files within the supplied prefix, returning a list of
	// objects within that prefix. If delimiter is non-empty names which have the
	// same prefix, prior to the delimiter, are grouped into a single result which

Prior to the delimiter but after the given prefix. Maybe a short example would help.


vfs/vfs.go line 202 at r6 (raw file):

	// objects within that prefix. If delimiter is non-empty names which have the
	// same prefix, prior to the delimiter, are grouped into a single result which
	// is that prefix. The order that results are returned is undefined.

How can the caller differentiate between an object or a group ("subdirectory")


vfs/vfs.go line 202 at r6 (raw file):

Previously, sumeerbhola wrote…

So the prefix is used for filtering regardless of delimiter, but the delimiter is for grouping but not ordering? I am a bit confused. Can you give an example?

This is my understanding - say we have files

a
x/y/1
x/y/2
x/z/1
x/z/2

Say we use no prefix (prefix = ""). If we don't use a delimiter, we get all of the above. If we use delimiter / we get a and x. A set delimiter means we are listing "directory style"

Say we use prefix x/. If we don't use a delimiter, we get all four that start with x/. If we use delimiter / we get x/y and x/z (or maybe y and z - I'm not sure if the result would contain the x/ prefix we provided or not).

AFAICT, there is nothing special about the delimiter used. From what I've read, S3 will treat all names as opaque strings but will use a prefix to break things up into shards internally (the first X chars, where X is chosen depending on how many different values there are for X). Which explains why in some cases you want to include hashed shard early in the filenames.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 16 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)


vfs/vfs.go line 190 at r6 (raw file):

Previously, RaduBerinde wrote…

This should explain the semantics when the object exists (I believe it will be overwritten?). Maybe CreateObject is a better name than Writer.

Also, I believe this is an all-or-nothing operation (at least on S3 - either there is a single PutObject request, or a multi-part upload which only becomes effective with CompleteMultiPartUpload). Is this something that would be useful to promise here? I don't think we care about that for SSTs, maybe we would for the blob feature @sumeerbhola is working on?

Blob files will be written the same way as SSTs.


vfs/vfs.go line 202 at r6 (raw file):

Previously, RaduBerinde wrote…

This is my understanding - say we have files

a
x/y/1
x/y/2
x/z/1
x/z/2

Say we use no prefix (prefix = ""). If we don't use a delimiter, we get all of the above. If we use delimiter / we get a and x. A set delimiter means we are listing "directory style"

Say we use prefix x/. If we don't use a delimiter, we get all four that start with x/. If we use delimiter / we get x/y and x/z (or maybe y and z - I'm not sure if the result would contain the x/ prefix we provided or not).

AFAICT, there is nothing special about the delimiter used. From what I've read, S3 will treat all names as opaque strings but will use a prefix to break things up into shards internally (the first X chars, where X is chosen depending on how many different values there are for X). Which explains why in some cases you want to include hashed shard early in the filenames.

Is this behavior specific to S3?
It sounds like specifying the delimiter means don't do a recursive search into the directory for files. The phrasing sounds confusing to me.

Do we have a use case for specifying the delimiter? If not, can we drop it from the interface until we need it.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 16 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @sumeerbhola)


vfs/vfs.go line 202 at r6 (raw file):

Previously, sumeerbhola wrote…

Is this behavior specific to S3?
It sounds like specifying the delimiter means don't do a recursive search into the directory for files. The phrasing sounds confusing to me.

Do we have a use case for specifying the delimiter? If not, can we drop it from the interface until we need it.

google and azure support the same delimiter semantics (as does the CRDBcloud.ExternalStorage interface).

I don't immediately see a usecase for us. We could efficiently list all "creator-id"s that have shared objects I guess.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @sumeerbhola from 6 discussions.
Reviewable status: 1 of 16 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


vfs/vfs.go line 178 at r6 (raw file):

Previously, RaduBerinde wrote…

I would move this to objstorage (maybe in its own subpackage, e.g. objstorage/sharedobj) and rename File to Object in the methods.

I'd also remove this TODO - if we move the middleware behind an interface, it would be a new higher-level interface (fairly similar to the current Provider public "interface"), and we would still use this interface inside that implementation.

Moved. Also updated the TODO to just be about the latter part - pushing obsoletion behind it. It's just a TODO so we can always just drop it.


vfs/vfs.go line 184 at r6 (raw file):

Previously, sumeerbhola wrote…

How about removing this shorthand. There is no special case of reading the start of a sstable.

Done.


vfs/vfs.go line 201 at r6 (raw file):

Previously, RaduBerinde wrote…

Prior to the delimiter but after the given prefix. Maybe a short example would help.

Done.


vfs/vfs.go line 202 at r6 (raw file):

Previously, RaduBerinde wrote…

google and azure support the same delimiter semantics (as does the CRDBcloud.ExternalStorage interface).

I don't immediately see a usecase for us. We could efficiently list all "creator-id"s that have shared objects I guess.

Good point, added some examples. PTAL.


vfs/vfs.go line 202 at r6 (raw file):

Previously, RaduBerinde wrote…

How can the caller differentiate between an object or a group ("subdirectory")

Not in an obvious way. The ReaderAt and Size call would fail if you called it at just a directory. Mimics how we use ExternalStorage.

@itsbilal itsbilal changed the title *: Add vfs.SharedStorage interface for storing sstables *: Add objstorage.shared.Storage interface for storing sstables Feb 6, 2023
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


objstorage/shared/storage.go line 44 at r7 (raw file):

	//   List("", "/") -> ["a", "b"]
	//   List("b", "/") -> ["4", "5", "6"]
	//   List("b", "") -> ["4", "5", "6"]

Shouldn't this be "/4", "/5", "/6"?

This change adds a objstorage.shared.Storage interface that can be
implemented by blob storage drivers. The objstorage.Provider
coming in cockroachdb#2267 will largely be responsible for storing
state of files' locations (i.e. shared or local), and calling into
Storage as necessary.
@itsbilal
Copy link
Member Author

itsbilal commented Feb 6, 2023

objstorage/shared/storage.go line 44 at r7 (raw file):

Previously, RaduBerinde wrote…

Shouldn't this be "/4", "/5", "/6"?

Oops, yes. Fixed

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 17 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

@itsbilal itsbilal merged commit e35d9f5 into cockroachdb:master Feb 7, 2023
@itsbilal itsbilal deleted the vfs-shared branch February 7, 2023 14:38
@itsbilal itsbilal restored the vfs-shared branch February 7, 2023 14:38
@itsbilal itsbilal deleted the vfs-shared branch February 7, 2023 14:38
@itsbilal itsbilal restored the vfs-shared branch February 7, 2023 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants