storage: sideloading bypasses RocksDB env for some operations #31913

tbg · 2018-10-26T08:34:25Z

I noticed in #31732 (comment) that the sideloaded storage does not perform the TruncateTo operation through the RocksDB env:

cockroach/pkg/storage/replica_sideload_disk.go

Lines 135 to 149 in fa7fb35

    
           	return err 
        
           } 
        
           var deleted int 
        
           for _, match := range matches { 
        
           	base := filepath.Base(match) 
        
           	if len(base) < 1 || base[0] != 'i' { 
        
           		continue 
        
           	} 
        
           	base = base[1:] 
        
           	upToDot := strings.SplitN(base, ".", 2) 
        
           	i, err := strconv.ParseUint(upToDot[0], 10, 64) 
        
           	if err != nil { 
        
           		return errors.Wrapf(err, "while parsing %q during TruncateTo", match) 
        
           	} 
        
           	if i >= index {

This may or may not be a problem, but it's definitely suspicious. Note how Clear just above the snippet does go through extra trouble to use the env.

Perhaps this all works as intended, depending on how the RocksDB env is implemented. It does seem to work when encryption is on, probably because the env really just does the same thing writing directly to the FS would do. With encryption, is that still true, i.e. are files still passed through to the filesystem into the same location (mod scrambling the contents)? I suspect it might be, but then why does Clear go the extra mile through the env?

This really seems like something needs to be investigated and cleaned up.

Note that there's one case where things are definitely broken, namely the one in which you use an in-memory RocksDB instance. In that case, TruncateTo will snoop around on disk, but the env will keep everything in memory (i.e. TruncateTo becomes a no-op and garbage will pile up in-mem).

The text was updated successfully, but these errors were encountered:

We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.

31914: storage: fix Raft log size accounting r=petermattis a=tschottdorf We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as #31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches #31732. Touches #31740. Touches #30261. Touches #31768. Touches #31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.

petermattis · 2018-12-16T20:59:02Z

@mberhault Also related to encryption at rest. It is believed that this leaks non-encrypted data to disk.

tbg · 2019-02-14T16:33:50Z

It is believed that this leaks non-encrypted data to disk.

Hmm, not sure about that. The files are written through the env:

cockroach/pkg/storage/syncing_write.go

Lines 90 to 114 in 2aba151

    
           f, err := eng.OpenFile(filename) 
        
           if err != nil { 
        
           	if strings.Contains(err.Error(), "No such file or directory") { 
        
           		return os.ErrNotExist 
        
           	} 
        
           	return err 
        
           } 
        
           for i := int64(0); i < int64(len(data)); i += chunkSize { 
        
           	end := i + chunkSize 
        
           	if l := int64(len(data)); end > l { 
        
           		end = l 
        
           	} 
        
           	chunk := data[i:end] 
        
           	// rate limit 
        
           	limitBulkIOWrite(ctx, limiter, len(chunk)) 
        
           	err = f.Append(chunk) 
        
           	if err == nil && sync { 
        
           		err = f.Sync() 
        
           	} 
        
           	if err != nil { 
        
           		break 
        
           	}

@mberhault also tells me in #19783 that the encrypted files have the same size as the unencrypted version, so in my eyes this becomes a cleanup rather than a technical necessity.

Update filesystem access to always go through the storage engine's filesystem interface, which ensures correctness for in-mem and encrypted filesystems Also, add a Stat function to the storage/fs.FS interface. The RocksDB implementation is still a hack, because the RocksDB Env doesn't expose sufficient information for implementing. For on-disk RocksDB engines, this implementation circumvents the Env, performing a direct os.Stat of the filesystem. For in-memory RocksDB engines, it provides a mocked os.FileInfo implementation. Fixes cockroachdb#42034. Related to cockroachdb#31913. Release note: None

Update filesystem access to always go through the storage engine's filesystem interface, which ensures correctness for in-mem and encrypted filesystems Also, add a Stat function to the storage/fs.FS interface. The RocksDB implementation is still a hack, because the RocksDB Env doesn't expose sufficient information for implementing. For on-disk RocksDB engines, this implementation circumvents the Env, performing a direct os.Stat of the filesystem. For in-memory RocksDB engines, it provides a mocked os.FileInfo implementation that panics when any of its methods are invoked. This is sufficient for existing call sites that use Stat to check for existence. Fixes cockroachdb#42034. Related to cockroachdb#31913. Release note: None

Update filesystem access to always go through the storage engine's filesystem interface, which ensures correctness for in-mem and encrypted filesystems Also, add a Stat function to the `storage/fs.FS` interface. The RocksDB implementation is still a hack, because the RocksDB Env doesn't expose sufficient information for implementing. For on-disk RocksDB engines, this implementation circumvents the Env, performing a direct os.Stat of the filesystem. For in-memory RocksDB engines, it provides a mocked `os.FileInfo` implementation that panics when any of its methods are invoked. This is sufficient for existing call sites that use Stat to check for existence. Fixes cockroachdb#42034. Related to cockroachdb#31913. Release note: None

49717: kvserver: use engine's filesystem r=jbowens a=jbowens Update filesystem access to always go through the storage engine's filesystem interface, which ensures correctness for in-mem and encrypted filesystems Also, add a Stat function to the storage/fs.FS interface. The RocksDB implementation is still a hack, because the RocksDB Env doesn't expose sufficient information for implementing. For on-disk RocksDB engines, this implementation circumvents the Env, performing a direct os.Stat of the filesystem. For in-memory RocksDB engines, it provides a mocked os.FileInfo implementation. Fixes #42034. Related to #31913. Release note: None Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>

tbg · 2020-09-22T09:46:13Z

@petermattis could storage take a look at this? Concretely, whether we're leaking unencrypted data to disk and if not, if we feel good about the status quo.

petermattis · 2020-09-22T12:03:07Z

I think @jbowens possibly already fixed this in 20.2. @jbowens can you see if there is anything left to do here?

jbowens · 2020-09-28T22:40:02Z

@jbowens can you see if there is anything left to do here?

I think we're all good. Closing.

While investigating cockroachdb#59126, I realized we were still going through the engine for listing the files. I rectified this and as a result, completely removed the in-mem implementation of the sideloaded storage that we had in place: instead of it, we simply use an in-mem engine, which now works like a charm thanks to not bypassing the engine for listing the file contents (`filepath.Glob` --> `eng.List`). While I was here, I added some defense in depth: it was unreasonable that `TruncateTo` was returning an error as a result of failing to perform best-effort cleanup on its directory. I am no longer seeing the spurious errors observed in cockroachdb#59126 with this change. This makes sense, as in this test the sideloaded storage is set up with an in-memory engine. The old code, which read from actual disk, would try to glob `auxiliary/...` (i.e. some path relative to the cwd - oops), which would return no results. As a result, the sideloaded storage would try to invoke `eng.Remove("auxiliary/...")` which the engine would refuse, since there would actually be (in-mem) files stored in the engine. This all resolves now that everyone is looking at the same files. It's possible that this fixes the issue with BenchmarkUserFile, as previously errors from TruncateTo could block the raft log queue, which in turn could at least slow down the test, but it's hard to say for sure. Touches cockroachdb#59126. Touches cockroachdb#31913. Release note: None

59103: tracing: link child into parent, even if not verbose r=irfansharif a=tbg Prior to this change, when a child was derived from a local parent, we would not register the child with the parent. In effect, this meant that any payloads in the child would not be collected. Release note: None 59134: kvserver: avoid engine bypass in sideload storage r=jbowens a=tbg While investigating #59126, I realized we were still going through the engine for listing the files. I rectified this and as a result, completely removed the in-mem implementation of the sideloaded storage that we had in place: instead of it, we simply use an in-mem engine, which now works like a charm thanks to not bypassing the engine for listing the file contents (`filepath.Glob` --> `eng.List`). While I was here, I added some defense in depth: it was unreasonable that `TruncateTo` was returning an error as a result of failing to perform best-effort cleanup on its directory. I am no longer seeing the spurious errors observed in #59126 with this change. Touches #59126. Touches #31913. Fixes #58948. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

While investigating cockroachdb#59126, I realized we were still going through the engine for listing the files. I rectified this and as a result, completely removed the in-mem implementation of the sideloaded storage that we had in place: instead of it, we simply use an in-mem engine, which now works like a charm thanks to not bypassing the engine for listing the file contents (`filepath.Glob` --> `eng.List`). While I was here, I added some defense in depth: it was unreasonable that `TruncateTo` was returning an error as a result of failing to perform best-effort cleanup on its directory. I am no longer seeing the spurious errors observed in cockroachdb#59126 with this change. This makes sense, as in this test the sideloaded storage is set up with an in-memory engine. The old code, which read from actual disk, would try to glob `auxiliary/...` (i.e. some path relative to the cwd - oops), which would return no results. As a result, the sideloaded storage would try to invoke `eng.Remove("auxiliary/...")` which the engine would refuse, since there would actually be (in-mem) files stored in the engine. This all resolves now that everyone is looking at the same files. It's possible that this fixes the issue with BenchmarkUserFile, as previously errors from TruncateTo could block the raft log queue, which in turn could at least slow down the test, but it's hard to say for sure. Touches cockroachdb#59126. Touches cockroachdb#31913. Release note: None

tbg added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Oct 26, 2018

tbg added this to the 2.2 milestone Oct 26, 2018

tbg assigned mberhault Oct 26, 2018

tbg mentioned this issue Oct 26, 2018

storage: fix Raft log size accounting #31914

Merged

tbg added the S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. label Oct 27, 2018

petermattis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 28, 2018

tbg mentioned this issue Feb 14, 2019

security: encryption at rest #19783

Closed

29 tasks

mberhault removed their assignment Dec 18, 2019

jbowens mentioned this issue May 29, 2020

kvserver: use engine's filesystem #49717

Merged

jbowens mentioned this issue May 29, 2020

storage/engine: add FS interface and sanitize handling of aux-dir #42034

Closed

andreimatei assigned tbg Sep 1, 2020

tbg removed their assignment Sep 22, 2020

petermattis assigned jbowens Sep 22, 2020

jbowens closed this as completed Sep 28, 2020

tbg mentioned this issue Jan 19, 2021

kvserver: avoid engine bypass in sideload storage #59134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: sideloading bypasses RocksDB env for some operations #31913

storage: sideloading bypasses RocksDB env for some operations #31913

tbg commented Oct 26, 2018

petermattis commented Dec 16, 2018

tbg commented Feb 14, 2019

tbg commented Sep 22, 2020

petermattis commented Sep 22, 2020

jbowens commented Sep 28, 2020

storage: sideloading bypasses RocksDB env for some operations #31913

storage: sideloading bypasses RocksDB env for some operations #31913

Comments

tbg commented Oct 26, 2018

petermattis commented Dec 16, 2018

tbg commented Feb 14, 2019

tbg commented Sep 22, 2020

petermattis commented Sep 22, 2020

jbowens commented Sep 28, 2020