Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/engine: add FS interface and sanitize handling of aux-dir #42034

Closed
petermattis opened this issue Oct 30, 2019 · 9 comments · Fixed by #49717
Closed

storage/engine: add FS interface and sanitize handling of aux-dir #42034

petermattis opened this issue Oct 30, 2019 · 9 comments · Fixed by #49717
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.

Comments

@petermattis
Copy link
Collaborator

The handling of Engine.GetAuxiliaryDir is a bit of a mess. diskSideloadStorage, and SSTSnapshotStorage both use the aux dir for storage, but assume that the filesystem routines in os will work on them. This is not true for in-memory engines and it tests only accidentally work because the RocksDB MemEnv does not implement full filesystem semantics (a parent directory is auto-created when a file is created).

In addition to cleaning up the handling of aux-dir, we should separate out the Engine.*File routines to their own interface. The pebble/vfs.FS interface is a good one to model on, though I think we only need a subset of that interface. For example, FS.Lock and FS.List do not appear to be needed.

@petermattis
Copy link
Collaborator Author

Btw, my rationale for wanting to separate out the FS interface (besides a nod to cleanliness) is because that will force examination of all of the code which uses that functionality.

@petermattis petermattis self-assigned this Oct 30, 2019
@petermattis
Copy link
Collaborator Author

The SQL execution folks would like File.ReadAt which pebble/vfs.File exposes.

@sumeerbhola
Copy link
Collaborator

Some clarifying questions:

  • Current callers that are using the os package to interact with strings representing some path in the auxiliary dir: There are calls to os.MkdirAll(), os.RemoveAll(), os.Stat(), os.Rename() (in newDiskSideloadStorage()).

    • I am assuming we can migrate them sometime after we add the new interface. Correct?
    • Is the auxiliary dir not holding encrypted files since doing a rename on a directory containing encrypted files without involving the rocksdb::Env or the pebble vfs.FS would cause those files to become unreadable? Or does the code that uses os.Rename() know that the directory will be empty?
  • If we add a new engine.FS interface, should it be implemented by pebble's vfs.FS even if the engine is configured to be RocksDB? I was initially assuming yes, to avoid implementing it in C++ too, but it doesn't seem possible if encryption is enabled -- we can't use Pebble's encrypted FS and the RocksDB encrypted Env at the same time since they use the same file paths for the file and key registry files.

  • Should the engine.FS interface be closer to vfs.FS, so we will need engine.File (or just reuse vfs.File), or do whole file operations like the current engine interface? If we need File.ReadAt() I am assuming the former and the latter can just be helper wrappers.

    • To implement these with the rocksdb::Env we will need to map these to the WritableFile, RandomAccessFile, SequentialFile, RandomRWFile interfaces: this should be easy. I think we would map to a WritableFile or a RandomAccessFile.
    • Since we need a RocksDB implementation do we need to avoid the cgo crossings by keeping the whole file operations in the interface -- so we will have both the current whole file operations and vfs.FS?

@petermattis
Copy link
Collaborator Author

I am assuming we can migrate them sometime after we add the new interface. Correct?

Yes. I think all of these direct uses of os should go away, but it doesn't need to be done in lock-step.

Is the auxiliary dir not holding encrypted files since doing a rename on a directory containing encrypted files without involving the rocksdb::Env or the pebble vfs.FS would cause those files to become unreadable? Or does the code that uses os.Rename() know that the directory will be empty?

I'm not sure this is all being done correctly. The combined usage of os and the engine.*File methods smells like lurking bugs to me.

If we add a new engine.FS interface, should it be implemented by pebble's vfs.FS even if the engine is configured to be RocksDB? I was initially assuming yes, to avoid implementing it in C++ too, but it doesn't seem possible if encryption is enabled -- we can't use Pebble's encrypted FS and the RocksDB encrypted Env at the same time since they use the same file paths for the file and key registry files.

I was assuming no. As you note, we need engine.FS to plumb through into the rocksdb::Env methods. We have a smattering of the needed plumbing already done. I'm proposing we flesh out the remaining bits.

Should the engine.FS interface be closer to vfs.FS, so we will need engine.File (or just reuse vfs.File), or do whole file operations like the current engine interface? If we need File.ReadAt() I am assuming the former and the latter can just be helper wrappers.

I think engine.FS should be closer to vfs.FS, probably a subset. I think engine.File should be a subset of vfs.File (eliding the methods we don't use).

Since we need a RocksDB implementation do we need to avoid the cgo crossings by keeping the whole file operations in the interface -- so we will have both the current whole file operations and vfs.FS?

I think the whole file operations could be emulated by writing the entire file in one chunk and then performing a close. So 1 cgo call becomes 3: Open/Write/Close. I'm pretty sure the cost of the Write will overwhelm the extra cgo calls.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Dec 5, 2019
The interface is not identical to Pebble's vfs.FS and missing
some of the functionality due to the need to implement this also
via the RocksDB Env. But it does include File.ReadAt that is
mentioned in cockroachdb#42034.

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Dec 10, 2019
The interface is not identical to Pebble's vfs.FS and missing
some of the functionality due to the need to implement this also
via the RocksDB Env. But it does include File.ReadAt that is
mentioned in cockroachdb#42034.
And added some tests for both existing and new functionality.

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Dec 10, 2019
The interface is not identical to Pebble's vfs.FS and missing
some of the functionality due to the need to implement this also
via the RocksDB Env. But it does include File.ReadAt that is
mentioned in cockroachdb#42034.
And added some tests for both existing and new functionality.

Release note: None
craig bot pushed a commit that referenced this issue Dec 10, 2019
42995: storage/engine: add basic FS interface r=sumeerbhola a=sumeerbhola

The interface is not identical to Pebble's vfs.FS and missing
some of the functionality due to the need to implement this also
via the RocksDB Env. But it does include File.ReadAt that is
mentioned in #42034.

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
@petermattis petermattis assigned jbowens and unassigned sumeerbhola Apr 29, 2020
@petermattis
Copy link
Collaborator Author

@jbowens We've added the storage/fs.FS interface to storage.Engine, but there is a bunch more cleanup we can plumb through here. In particular, various bits of kv/kvserver do stuff directly on the filesystem using os even when using an "in-memory" engine. I think anything in kv/kvserver which uses os file operations should be cleaned up. There are also some TODOs in the code about hacks that have been added such as https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble.go#L439-L444.

Another cleanup: storage/fs.FS is similar to pebble/vfs.FS, but uses slightly different names. There is no strong reason for this difference. I'd like to move everything to use the pebble/vfs.FS names which are a closer match to what exists in os.

@jbowens
Copy link
Collaborator

jbowens commented May 26, 2020

Just noticed this discrepancy between the storage.fs.FS documentation and the pebble implementation removing nonexistent directories. It seems like the implementation is the correct one, for as long as we're maintaining parity with the RocksDB Env, right?

// RemoveDirAndFiles deletes the directory and any files it contains but
// not subdirectories. If dir does not exist, RemoveDirAndFiles returns nil
// (no error).

// NB RemoveAll does not return an error if dir does not exist, but we want
// RemoveDirAndFiles to return an error in that case to match the RocksDB
// behavior.

@petermattis
Copy link
Collaborator Author

Just noticed this discrepancy between the storage.fs.FS documentation and the pebble implementation removing nonexistent directories. It seems like the implementation is the correct one, for as long as we're maintaining parity with the RocksDB Env, right?

Yes. I think we could invert this and have RocksDB special case not return an error. It would just require elbow grease to plumb that behavior change (I think it affects tests, but I'm not sure of that).

jbowens added a commit to jbowens/cockroach that referenced this issue May 29, 2020
Update filesystem access to always go through the storage engine's
filesystem interface, which ensures correctness for in-mem and encrypted
filesystems

Also, add a Stat function to the storage/fs.FS interface. The RocksDB
implementation is still a hack, because the RocksDB Env doesn't expose
sufficient information for implementing. For on-disk RocksDB engines,
this implementation circumvents the Env, performing a direct os.Stat
of the filesystem. For in-memory RocksDB engines, it provides a mocked
os.FileInfo implementation.

Fixes cockroachdb#42034.
Related to cockroachdb#31913.

Release note: None
@jbowens
Copy link
Collaborator

jbowens commented May 29, 2020

#31913 looks related, although #49717 doesn't exactly fix the mentioned fact that file size reads aren't threaded all the way through to the RocksDB Env. IIRC, Env does have a getFileAttributes method that includes file size. I'm not sure if we care about that bit of cleanup since it's RocksDB-only and the encrypted file sizes are equal?

jbowens added a commit to jbowens/cockroach that referenced this issue May 29, 2020
Update filesystem access to always go through the storage engine's
filesystem interface, which ensures correctness for in-mem and encrypted
filesystems

Also, add a Stat function to the storage/fs.FS interface. The RocksDB
implementation is still a hack, because the RocksDB Env doesn't expose
sufficient information for implementing. For on-disk RocksDB engines,
this implementation circumvents the Env, performing a direct os.Stat
of the filesystem. For in-memory RocksDB engines, it provides a mocked
os.FileInfo implementation that panics when any of its methods are
invoked. This is sufficient for existing call sites that use Stat to
check for existence.

Fixes cockroachdb#42034.
Related to cockroachdb#31913.

Release note: None
@petermattis
Copy link
Collaborator Author

#31913 looks related, although #49717 doesn't exactly fix the mentioned fact that file size reads aren't threaded all the way through to the RocksDB Env. IIRC, Env does have a getFileAttributes method that includes file size. I'm not sure if we care about that bit of cleanup since it's RocksDB-only and the encrypted file sizes are equal?

I don't particular mind that the file size retrieval isn't thread all the way through to RocksDB, but we should definitely be using a method on engine.FS for retrieving the file size.

@knz knz added A-storage Relating to our storage engine (Pebble) on-disk storage. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. labels Jun 4, 2020
craig bot pushed a commit that referenced this issue Jun 8, 2020
49717: kvserver: use engine's filesystem r=jbowens a=jbowens

Update filesystem access to always go through the storage engine's
filesystem interface, which ensures correctness for in-mem and encrypted
filesystems

Also, add a Stat function to the storage/fs.FS interface. The RocksDB
implementation is still a hack, because the RocksDB Env doesn't expose
sufficient information for implementing. For on-disk RocksDB engines,
this implementation circumvents the Env, performing a direct os.Stat
of the filesystem. For in-memory RocksDB engines, it provides a mocked
os.FileInfo implementation.

Fixes #42034.
Related to #31913.

Release note: None

Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
@craig craig bot closed this as completed in 028ca9e Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants