-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: sideloading bypasses RocksDB env for some operations #31913
Comments
We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.
We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.
31914: storage: fix Raft log size accounting r=petermattis a=tschottdorf We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as #31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches #31732. Touches #31740. Touches #30261. Touches #31768. Touches #31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.
We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.
@mberhault Also related to encryption at rest. It is believed that this leaks non-encrypted data to disk. |
Hmm, not sure about that. The files are written through the env: cockroach/pkg/storage/syncing_write.go Lines 90 to 114 in 2aba151
@mberhault also tells me in #19783 that the encrypted files have the same size as the unencrypted version, so in my eyes this becomes a cleanup rather than a technical necessity. |
Update filesystem access to always go through the storage engine's filesystem interface, which ensures correctness for in-mem and encrypted filesystems Also, add a Stat function to the storage/fs.FS interface. The RocksDB implementation is still a hack, because the RocksDB Env doesn't expose sufficient information for implementing. For on-disk RocksDB engines, this implementation circumvents the Env, performing a direct os.Stat of the filesystem. For in-memory RocksDB engines, it provides a mocked os.FileInfo implementation. Fixes cockroachdb#42034. Related to cockroachdb#31913. Release note: None
Update filesystem access to always go through the storage engine's filesystem interface, which ensures correctness for in-mem and encrypted filesystems Also, add a Stat function to the storage/fs.FS interface. The RocksDB implementation is still a hack, because the RocksDB Env doesn't expose sufficient information for implementing. For on-disk RocksDB engines, this implementation circumvents the Env, performing a direct os.Stat of the filesystem. For in-memory RocksDB engines, it provides a mocked os.FileInfo implementation that panics when any of its methods are invoked. This is sufficient for existing call sites that use Stat to check for existence. Fixes cockroachdb#42034. Related to cockroachdb#31913. Release note: None
Update filesystem access to always go through the storage engine's filesystem interface, which ensures correctness for in-mem and encrypted filesystems Also, add a Stat function to the `storage/fs.FS` interface. The RocksDB implementation is still a hack, because the RocksDB Env doesn't expose sufficient information for implementing. For on-disk RocksDB engines, this implementation circumvents the Env, performing a direct os.Stat of the filesystem. For in-memory RocksDB engines, it provides a mocked `os.FileInfo` implementation that panics when any of its methods are invoked. This is sufficient for existing call sites that use Stat to check for existence. Fixes cockroachdb#42034. Related to cockroachdb#31913. Release note: None
49717: kvserver: use engine's filesystem r=jbowens a=jbowens Update filesystem access to always go through the storage engine's filesystem interface, which ensures correctness for in-mem and encrypted filesystems Also, add a Stat function to the storage/fs.FS interface. The RocksDB implementation is still a hack, because the RocksDB Env doesn't expose sufficient information for implementing. For on-disk RocksDB engines, this implementation circumvents the Env, performing a direct os.Stat of the filesystem. For in-memory RocksDB engines, it provides a mocked os.FileInfo implementation. Fixes #42034. Related to #31913. Release note: None Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
@petermattis could storage take a look at this? Concretely, whether we're leaking unencrypted data to disk and if not, if we feel good about the status quo. |
I think we're all good. Closing. |
While investigating cockroachdb#59126, I realized we were still going through the engine for listing the files. I rectified this and as a result, completely removed the in-mem implementation of the sideloaded storage that we had in place: instead of it, we simply use an in-mem engine, which now works like a charm thanks to not bypassing the engine for listing the file contents (`filepath.Glob` --> `eng.List`). While I was here, I added some defense in depth: it was unreasonable that `TruncateTo` was returning an error as a result of failing to perform best-effort cleanup on its directory. I am no longer seeing the spurious errors observed in cockroachdb#59126 with this change. This makes sense, as in this test the sideloaded storage is set up with an in-memory engine. The old code, which read from actual disk, would try to glob `auxiliary/...` (i.e. some path relative to the cwd - oops), which would return no results. As a result, the sideloaded storage would try to invoke `eng.Remove("auxiliary/...")` which the engine would refuse, since there would actually be (in-mem) files stored in the engine. This all resolves now that everyone is looking at the same files. It's possible that this fixes the issue with BenchmarkUserFile, as previously errors from TruncateTo could block the raft log queue, which in turn could at least slow down the test, but it's hard to say for sure. Touches cockroachdb#59126. Touches cockroachdb#31913. Release note: None
While investigating cockroachdb#59126, I realized we were still going through the engine for listing the files. I rectified this and as a result, completely removed the in-mem implementation of the sideloaded storage that we had in place: instead of it, we simply use an in-mem engine, which now works like a charm thanks to not bypassing the engine for listing the file contents (`filepath.Glob` --> `eng.List`). While I was here, I added some defense in depth: it was unreasonable that `TruncateTo` was returning an error as a result of failing to perform best-effort cleanup on its directory. I am no longer seeing the spurious errors observed in cockroachdb#59126 with this change. This makes sense, as in this test the sideloaded storage is set up with an in-memory engine. The old code, which read from actual disk, would try to glob `auxiliary/...` (i.e. some path relative to the cwd - oops), which would return no results. As a result, the sideloaded storage would try to invoke `eng.Remove("auxiliary/...")` which the engine would refuse, since there would actually be (in-mem) files stored in the engine. This all resolves now that everyone is looking at the same files. It's possible that this fixes the issue with BenchmarkUserFile, as previously errors from TruncateTo could block the raft log queue, which in turn could at least slow down the test, but it's hard to say for sure. Touches cockroachdb#59126. Touches cockroachdb#31913. Release note: None
59103: tracing: link child into parent, even if not verbose r=irfansharif a=tbg Prior to this change, when a child was derived from a local parent, we would not register the child with the parent. In effect, this meant that any payloads in the child would not be collected. Release note: None 59134: kvserver: avoid engine bypass in sideload storage r=jbowens a=tbg While investigating #59126, I realized we were still going through the engine for listing the files. I rectified this and as a result, completely removed the in-mem implementation of the sideloaded storage that we had in place: instead of it, we simply use an in-mem engine, which now works like a charm thanks to not bypassing the engine for listing the file contents (`filepath.Glob` --> `eng.List`). While I was here, I added some defense in depth: it was unreasonable that `TruncateTo` was returning an error as a result of failing to perform best-effort cleanup on its directory. I am no longer seeing the spurious errors observed in #59126 with this change. Touches #59126. Touches #31913. Fixes #58948. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
While investigating cockroachdb#59126, I realized we were still going through the engine for listing the files. I rectified this and as a result, completely removed the in-mem implementation of the sideloaded storage that we had in place: instead of it, we simply use an in-mem engine, which now works like a charm thanks to not bypassing the engine for listing the file contents (`filepath.Glob` --> `eng.List`). While I was here, I added some defense in depth: it was unreasonable that `TruncateTo` was returning an error as a result of failing to perform best-effort cleanup on its directory. I am no longer seeing the spurious errors observed in cockroachdb#59126 with this change. This makes sense, as in this test the sideloaded storage is set up with an in-memory engine. The old code, which read from actual disk, would try to glob `auxiliary/...` (i.e. some path relative to the cwd - oops), which would return no results. As a result, the sideloaded storage would try to invoke `eng.Remove("auxiliary/...")` which the engine would refuse, since there would actually be (in-mem) files stored in the engine. This all resolves now that everyone is looking at the same files. It's possible that this fixes the issue with BenchmarkUserFile, as previously errors from TruncateTo could block the raft log queue, which in turn could at least slow down the test, but it's hard to say for sure. Touches cockroachdb#59126. Touches cockroachdb#31913. Release note: None
I noticed in #31732 (comment) that the sideloaded storage does not perform the
TruncateTo
operation through the RocksDB env:cockroach/pkg/storage/replica_sideload_disk.go
Lines 135 to 149 in fa7fb35
This may or may not be a problem, but it's definitely suspicious. Note how
Clear
just above the snippet does go through extra trouble to use the env.Perhaps this all works as intended, depending on how the RocksDB env is implemented. It does seem to work when encryption is on, probably because the env really just does the same thing writing directly to the FS would do. With encryption, is that still true, i.e. are files still passed through to the filesystem into the same location (mod scrambling the contents)? I suspect it might be, but then why does
Clear
go the extra mile through the env?This really seems like something needs to be investigated and cleaned up.
Note that there's one case where things are definitely broken, namely the one in which you use an in-memory RocksDB instance. In that case,
TruncateTo
will snoop around on disk, but the env will keep everything in memory (i.e.TruncateTo
becomes a no-op and garbage will pile up in-mem).The text was updated successfully, but these errors were encountered: