Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests use lots of disk space after recent CockroachDB upgrade #1004

Closed
davepacheco opened this issue May 3, 2022 · 1 comment · Fixed by #1005
Closed

tests use lots of disk space after recent CockroachDB upgrade #1004

davepacheco opened this issue May 3, 2022 · 1 comment · Fixed by #1005

Comments

@davepacheco
Copy link
Collaborator

tl;dr: Under #988 we updated CockroachDB to v20.2.9, which includes automatic creation of ballast files. Under some conditions (including our CI runners and tests run with a normal tmpfs), this causes disk space usage in tmpfs on the order of 1 GiB per concurrent test. People are seeing the test suite fail with ENOSPC both locally and in CI. @jgallagher is looking at disabling ballast files for the test suite, since they serve no purpose there.

@davepacheco
Copy link
Collaborator Author

@bnaecker had reported under #988 that he was seeing test runs fail with ENOSPC in tmpfs. In chat he reported:

---- db::pagination::test::test_paginated_multicolumn_descending stdout ----
log file: "/tmp/omicron_nexus-1855d6cf5935ccef-test_paginated_multicolumn_descending.6410.21.log"
note: configured to log to "/tmp/omicron_nexus-1855d6cf5935ccef-test_paginated_multicolumn_descending.6410.21.log"
thread 'db::pagination::test::test_paginated_multicolumn_descending' panicked at 'Cannot copy storage from seed directory: Failed to copy subdirectory /home/bnaecker/omicron/target/debug/build/nexus-test-utils-42930e5e644c453a/out/crdb-base/auxiliary to /tmp/.tmpoPpDcl/data/auxiliary

Caused by:
    0: Failed to copy file at /home/bnaecker/omicron/target/debug/build/nexus-test-utils-42930e5e644c453a/out/crdb-base/auxiliary/EMERGENCY_BALLAST to /tmp/.tmpoPpDcl/data/auxiliary/EMERGENCY_BALLAST
    1: No space left on device (os error 28)', /home/bnaecker/omicron/test-utils/src/dev/mod.rs:137:14

@jgallagher reported several CI failures due to running out of disk space: one, two.

I'd noticed the ballast file stuff in the new release but hadn't connected it to this problem. These files are supposed to used to recover from out-of-disk scenarios. The idea is that by reserving space up front that can be freed later, one can avoid using every last byte of disk space with no way to recover.

Ironically, they don't work on ZFS. We discovered this because Ben said du showed the files as tiny, even though ls showed them in the seed directory as 1 GiB:

bnaecker@feldspar : ~/omicron $ ll -h /home/bnaecker/omicron/target/debug/build/nexus-test-utils-42930e5e644c453a/out/crdb-base/auxiliary/
total 1
-rw-r-----   1 bnaecker other         1G May  3 18:07 EMERGENCY_BALLAST

I suspected this might be related to sparse files, so I truss'd cockroach debug ballast, but I found it was actually writing out blocks of zeros (this complicated invocation just filters out a bunch of noise...though there's still a lot of noise):

$ truss '-t!portfs,!lwp_park,!mmap' cockroach debug ballast the_file --size=1GiB
...
1:	stat("the_file", 0xC00065FCB8)			Err#2 ENOENT
/1:	open("the_file", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 7
/1:	fcntl(7, F_GETFL)				= 8193
/1:	fcntl(7, F_SETFL, FREAD|FOFFMAX|FNONBLOCK)	= 0
...
/1:	write(7, "\0\0\0\0\0\0\0\0\0\0\0\0".., 67108864) = 67108864
/1:	write(7, "\0\0\0\0\0\0\0\0\0\0\0\0".., 67108864) = 67108864
...
/1:	fdsync(7, FSYNC)				= 0
/1:	close(7)					= 0

That's not sparse at all. Nor do I see any use of posix_fallocate (which is fcntl with F_ALLOCSP, and would return EINVAL on ZFS anyway). It's just writing out blocks of zeros, which ZFS compresses to nothing. You can see it like this. First, with compression off, the file really gets created with 1 GiB of physical usage:

dap@ivanova tmp $ ls -l
total 0
dap@ivanova tmp $ zfs list -oname,compression $PWD
NAME                COMPRESS
rpool/home/dap/tmp       off
dap@ivanova tmp $ cockroach debug ballast the_ballast_file --size=1GiB
dap@ivanova tmp $ ls -l the_ballast_file 
-rw-r--r--   1 dap      staff    1073741824 May  3 12:36 the_ballast_file
dap@ivanova tmp $ du -sh the_ballast_file 
1.00G	the_ballast_file

With compression on (of any kind), zfs compresses zero-filled blocks:

dap@ivanova tmp $ rm -f the_ballast_file 
dap@ivanova tmp $ pfexec zfs set compression=on rpool/home/dap/tmp
dap@ivanova tmp $ ls -l 
total 0
dap@ivanova tmp $ cockroach debug ballast the_ballast_file --size=1GiB
dap@ivanova tmp $ ls -l the_ballast_file 
-rw-r--r--   1 dap      staff    1073741824 May  3 12:36 the_ballast_file
dap@ivanova tmp $ du -sh the_ballast_file 
512	the_ballast_file
dap@ivanova tmp $ 

This explains why I didn't notice this problem while digging into Ben's report on #988. I was checking the file's size on ZFS, and I run my tests with TMPDIR set to a ZFS directory, so I didn't see this.

We got distracted a bit looking at sfd::fs::copy, which unfortunately does not preserve sparseness of files. We thought maybe the file had been created sparse in the seed directory, then copied in a way that didn't preserve sparseness when running the actual tests. But this isn't a real sparse file, so I think that's unrelated.

So the current thinking is:

  • The build process creates the ballast file in the seed directory. Due to the CockroachDB issue, on ZFS, this will be tiny. In CI, on local Mac, or local Linux without ZFS, this could be 1 GiB or more.
  • The test process copies this file to TMPDIR once for each test and removes it after each test completes. So you'll wind up with space used in TMPDIR at about 1 GiB times the number of concurrent tests. Ben's machine has 24 cores and 20 GiB of tmpfs, so it makes sense he might sometimes run out of space. The GitHub CI runners have only 2 cores. I'm not sure this fully explains the CI failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant