Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vfs, table_cache: fadvise FADV_RANDOM on sstable files #215

Merged
merged 1 commit into from
Aug 12, 2019

Conversation

itsbilal
Copy link
Member

@itsbilal itsbilal commented Aug 8, 2019

This change calls fadvise with FADV_RANDOM on sstable file descriptors,
to ensure that readahead is disabled. This reduces wasted I/Os when
reading from sstables, since sstable reads especially for short to
medium range scans do not read large enough contiguous blocks to be
able to take advantage of readahead. Instead, readahead ends up reducing
user-visible I/O performance.

See #198 .

@petermattis
Copy link
Collaborator

This change is Reviewable

@itsbilal
Copy link
Member Author

itsbilal commented Aug 8, 2019

Some comparative benchmarks, command run (on a c5d.4xlarge) with 26gb malloc'd away by another program (see issue #198 for repro steps):

./pebble ycsb /mnt/data1/bench_16G --workload scan=100 --concurrency 16 --prepopulated-keys $[16*1024*1024*1024/64] --initial-keys 0 --keys uniform --duration 2m --cache $[1*1024*1024*1024] --scans 100

Pebble (before change):

____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   120.0s         303419         2527.6      2527558.9      6.3      6.3     10.0     21.0     65.0

Pebble (after change):

____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   120.0s         791940         6599.5      6599460.5      2.4      2.4      3.4      5.8    192.9

RocksDB (run 1):

____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   120.0s         785964         6549.6      6549625.3      2.4      2.5      3.3      4.2    184.5

RocksDB (run 2):

____optype__elapsed_____ops(total)___ops/sec(cum)__keys/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  scan_100   120.3s         766692         6374.0      6374020.5      2.5      2.5      3.4      6.6     88.1

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 9 files reviewed, 2 unresolved discussions (waiting on @ajkr, @itsbilal, and @petermattis)


table_cache.go, line 348 at r1 (raw file):

	}
	if fd, err := c.fs.Fd(f); err == nil {
		_ = vfs.FadviseRandom(fd)

I'm not sure about this structure as it exposes Fd which is in an implementation detail. An alternative would be to add vfs.File.AdviseRandom or add an option to vfs.VFS.Open.


vfs/fadvise_linux.go, line 14 at r1 (raw file):

func FadviseRandom(f uintptr) error {
	return unix.Fadvise(int(f), 0, 0, unix.FADV_RANDOM)
}

Nit: reviewable is complaining that this file is missing a newline at the end of the file.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about this for a bit, and ended up adding a bool argument to vfs.FS.Open. Options I considered:

  • Borrow fileType from pebble.filename and passing that as an argument to Open - but that would mean exporting the type across parent/child packages and it wasn't very clear whether vfs should own that type (or if it deserved to have its own iota type defining file types)

  • Making a new vfs.FileType - see above

  • Adding a method to vfs.File - but this would take away the efficiency of just using *os.File as a vfs.File.

Reviewable status: 0 of 23 files reviewed, 1 unresolved discussion (waiting on @ajkr and @petermattis)


table_cache.go, line 348 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm not sure about this structure as it exposes Fd which is in an implementation detail. An alternative would be to add vfs.File.AdviseRandom or add an option to vfs.VFS.Open.

Done.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the extra parameter more than the other options you mention. There is a nuance to the parameter approach which could shrink down the size of this PR, though with slightly higher overhead to the Open call. That approach adds an ...OpenOption parameter. Something like:

type OpenOpen interface {
  apply(f File)
}

type randomReadsOption struct{}

func (randomReadsOption) apply(f File) {
  // call posix_fadvise
}

var RandomReads randomReadsOption

type FS interface {
  Open(name string, opts ...OpenOption)
}

This pattern comes from https://commandcenter.blogspot.com/2014/01/self-referential-functions-and-design.html and we use it in a handful of places in CockroachDB. The downside is that it may involve an extra allocation at the caller, but for opening a file that isn't going to be a problem at all. The upside is that you don't have to specify the option at most callers. That is, fs.Open("foo") still works.

Reviewable status: 0 of 23 files reviewed, 1 unresolved discussion (waiting on @ajkr and @petermattis)

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way that patterns lets us preserve the simpler Open(filename) call. Made the change - PTAL!

Reviewable status: 0 of 10 files reviewed, 1 unresolved discussion (waiting on @ajkr and @petermattis)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Is there a way to test that this is working? Ideally we'd want to know that tableCache is passing this option and that specifying this option on linux is causing an fadvise system call. The reason for wanting a test is to make sure this doesn't rot. I think making sure tableCache is passing the option can be done in tableCacheTestFS.Open. Not sure about ensuring the option results in an fadvise system call on linux.

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @ajkr, @itsbilal, and @petermattis)


vfs/vfs.go, line 148 at r3 (raw file):

// random reads, by calling  fadvise() with POSIX_FADV_RANDOM on Linux systems
// to disable readahead. Only works when specified to defaultFS.
type RandomReadsOption struct{}

Rather than exporting the RandomReadsOption type, you can make this private and export a RandomReadsOption variable:

type randomReadsOption struct{}
var RandomReadsOption OpenOption = randomReadsOption{}

This makes the usage of the option every so slightly prettier.

This change calls fadvise with FADV_RANDOM on sstable file descriptors,
to ensure that readahead is disabled. This reduces wasted I/Os when
reading from sstables, since sstable reads especially for short to
medium range scans do not read large enough contiguous blocks to be
able to take advantage of readahead. Instead, readahead ends up reducing
user-visible I/O performance.

See cockroachdb#198 .
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I've made the tableCacheTest change as well as the type/var export one.

Can't think of a way to be able to unit test the posix_fadvise call itself, but the hope is that a change with a performance impact this significant will hopefully show up in our benchmarks/testing elsewhere.

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @ajkr, @itsbilal, and @petermattis)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't think of a way to be able to unit test the posix_fadvise call itself, but the hope is that a change with a performance impact this significant will hopefully show up in our benchmarks/testing elsewhere.

Ack.

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @ajkr, @itsbilal, and @petermattis)

@itsbilal itsbilal merged commit 2b6844f into cockroachdb:master Aug 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants