Skip to content

Optimize WAL storage in safekeeper #1318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 25, 2022
Merged

Optimize WAL storage in safekeeper #1318

merged 1 commit into from
Feb 25, 2022

Conversation

petuhovskiy
Copy link
Member

When several AppendRequest's can be read from socket without blocking,
they are processed together and fsync() to segment file is only called
once. Segment file is no longer opened for every write request, now
last opened file is cached inside PhysicalStorage. New metric for WAL
flushes was added to the storage, FLUSH_WAL_SECONDS. More errors were
added to storage for non-sequential WAL writes, now write_lsn can be
moved only with calls to truncate_lsn(new_lsn).

New messages have been added to ProposerAcceptorMessage enum. They
can't be deserialized directly and now are used only for optimizing
flushes. Existing protocol wasn't changed and flush will be called for
every AppendRequest, as it was before.

This PR replaces #1266, as a cleaner version of the same optimization. Closes #1144.
I'll post test results here when they're ready.

@petuhovskiy petuhovskiy force-pushed the sk-opt-fsync branch 2 times, most recently from fd866e6 to 10dbac2 Compare February 24, 2022 11:22
@petuhovskiy
Copy link
Member Author

Here are results of the test from #1144, to compare results with #1266.
Code to test the results of this PR: https://github.com/zenithdb/zenith/tree/test-perf-pr-1318
Same test backported to main: https://github.com/zenithdb/zenith/tree/test-perf-backport-1318

These results are for local run on my machine, which has relatively slow fsync (2364 ops/sec, 423 usecs/op) and not very powerful 8 core CPU. I've added "per fsync" metrics, which are useful for disk usage comparison.

Test info pgbench -i -s 200 (time) pgbench -N -T 60 init wal_bytes, per fsync bench wal_bytes, per fsync bench txes, per fsync
NO PATCH safekeeper + wp (fsync=on, no checkpoints) 108 s 1177 TPS 85kb 436 b 1 tx / fsync
PATCHED safekeeper + wp (fsync=on, no checkpoints) 87 s 2574 TPS 417kb 6314 b 15.4 tx / fsync
vanilla repl (fsync=on, no checkpoints) 146 s 4215 TPS 256kb 5552 b 14 tx / fsync

These results look the same as in #1266.

@petuhovskiy
Copy link
Member Author

In a more usual 1wp+3sk EC2 test (#799) results are also similar to previous PR #1266, safekeepers are faster than synchronous replication:

Code branch Test info pgbench -i -s 400 (time) pgbench TPS init wal_bytes, per fsync bench wal_bytes, per fsync bench txes, per fsync link to gist
main vanilla 1primary(fsync=off)+3replica(fsync=on) 130 s 14237 TPS 800 kb 1948 b 6.7 tx / fsync https://gist.github.com/petuhovskiy/35134419ccb0f287046a6b1c9a6150b8
this PR 1wp(fsync=off)+3sk(fsync=on) 92 s 20400 TPS 209kb 1309 b 4.5 tx / fsync https://gist.github.com/petuhovskiy/c16668bdfea05acf39a8154b81ac756b

Interesting that fsync calls count is very different between postgres synchronous replicas when pgbench -i is done, that is visible in gist report.

}

if let Some(mut unflushed_file) = self.file.take() {
self.fdatasync_file(&mut unflushed_file)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option can be matched with ref keyword to its contents to avoid taking/returning it around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to call self.fdatasync_file in this if, so with ref borrow I get an error:

cannot borrow `*self` as immutable because it is also borrowed as mutable
immutable borrow occurs hererustc[E0502](https://doc.rust-lang.org/error-index.html#E0502)

let mut partial;
let mut start_pos = startpos;
const ZERO_BLOCK: &[u8] = &[0u8; XLOG_BLCKSZ];
if self.write_lsn != pos {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this can be true?

Copy link
Member Author

@petuhovskiy petuhovskiy Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be, possible only if someone is using private API directly, functions like write_exact.

When several AppendRequest's can be read from socket without blocking,
they are processed together and fsync() to segment file is only called
once. Segment file is no longer opened for every write request, now
last opened file is cached inside PhysicalStorage. New metric for WAL
flushes was added to the storage, FLUSH_WAL_SECONDS. More errors were
added to storage for non-sequential WAL writes, now write_lsn can be
moved only with calls to truncate_lsn(new_lsn).

New messages have been added to ProposerAcceptorMessage enum. They
can't be deserialized directly and now are used only for optimizing
flushes. Existing protocol wasn't changed and flush will be called for
every AppendRequest, as it was before.
@petuhovskiy
Copy link
Member Author

These results are for local run on my machine, which has relatively slow fsync (2364 ops/sec, 423 usecs/op) and not very powerful 8 core CPU. I've added "per fsync" metrics, which are useful for disk usage comparison.

I've updated PR to use fdatasync in most places, and ran this test again. It seems that my machine has also much slower fsync, compared to fdatasync, and now results are almost the same:

Test info pgbench -i -s 200 (time) pgbench -N -T 60 init wal_bytes, per fsync bench wal_bytes, per fsync bench txes, per fsync
PATCHED safekeeper + wp (fsync=on, no checkpoints), using fdatasync 99 s 4041 TPS 301kb 5439 b 13.8 tx / fsync
vanilla repl (fsync=on, no checkpoints) 149 s 4170 TPS 257kb 5674 b 14.5 tx / fsync

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize flushes and AppendResponses in safekeeper
2 participants