Optimize WAL storage in safekeeper #1318

petuhovskiy · 2022-02-23T10:26:33Z

When several AppendRequest's can be read from socket without blocking,
they are processed together and fsync() to segment file is only called
once. Segment file is no longer opened for every write request, now
last opened file is cached inside PhysicalStorage. New metric for WAL
flushes was added to the storage, FLUSH_WAL_SECONDS. More errors were
added to storage for non-sequential WAL writes, now write_lsn can be
moved only with calls to truncate_lsn(new_lsn).

New messages have been added to ProposerAcceptorMessage enum. They
can't be deserialized directly and now are used only for optimizing
flushes. Existing protocol wasn't changed and flush will be called for
every AppendRequest, as it was before.

This PR replaces #1266, as a cleaner version of the same optimization. Closes #1144.
I'll post test results here when they're ready.

walkeeper/src/receive_wal.rs

walkeeper/src/wal_storage.rs

petuhovskiy · 2022-02-24T11:41:33Z

Here are results of the test from #1144, to compare results with #1266.
Code to test the results of this PR: https://github.com/zenithdb/zenith/tree/test-perf-pr-1318
Same test backported to main: https://github.com/zenithdb/zenith/tree/test-perf-backport-1318

These results are for local run on my machine, which has relatively slow fsync (2364 ops/sec, 423 usecs/op) and not very powerful 8 core CPU. I've added "per fsync" metrics, which are useful for disk usage comparison.

Test info	pgbench -i -s 200 (time)	pgbench -N -T 60	init wal_bytes, per fsync	bench wal_bytes, per fsync	bench txes, per fsync
NO PATCH safekeeper + wp (fsync=on, no checkpoints)	108 s	1177 TPS	85kb	436 b	1 tx / fsync
PATCHED safekeeper + wp (fsync=on, no checkpoints)	87 s	2574 TPS	417kb	6314 b	15.4 tx / fsync
vanilla repl (fsync=on, no checkpoints)	146 s	4215 TPS	256kb	5552 b	14 tx / fsync

These results look the same as in #1266.

walkeeper/src/wal_storage.rs

petuhovskiy · 2022-02-25T10:14:23Z

In a more usual 1wp+3sk EC2 test (#799) results are also similar to previous PR #1266, safekeepers are faster than synchronous replication:

Code branch	Test info	pgbench -i -s 400 (time)	pgbench TPS	init wal_bytes, per fsync	bench wal_bytes, per fsync	bench txes, per fsync	link to gist
main	vanilla 1primary(fsync=off)+3replica(fsync=on)	130 s	14237 TPS	800 kb	1948 b	6.7 tx / fsync	https://gist.github.com/petuhovskiy/35134419ccb0f287046a6b1c9a6150b8
this PR	1wp(fsync=off)+3sk(fsync=on)	92 s	20400 TPS	209kb	1309 b	4.5 tx / fsync	https://gist.github.com/petuhovskiy/c16668bdfea05acf39a8154b81ac756b

Interesting that fsync calls count is very different between postgres synchronous replicas when pgbench -i is done, that is visible in gist report.

walkeeper/src/receive_wal.rs

arssher · 2022-02-25T11:43:11Z

walkeeper/src/wal_storage.rs

+        }
+
+        if let Some(mut unflushed_file) = self.file.take() {
+            self.fdatasync_file(&mut unflushed_file)?;


Option can be matched with ref keyword to its contents to avoid taking/returning it around.

I need to call self.fdatasync_file in this if, so with ref borrow I get an error:

cannot borrow `*self` as immutable because it is also borrowed as mutable immutable borrow occurs hererustc[E0502](https://doc.rust-lang.org/error-index.html#E0502)

walkeeper/src/wal_storage.rs

arssher · 2022-02-25T11:48:34Z

walkeeper/src/wal_storage.rs

-        let mut partial;
-        let mut start_pos = startpos;
-        const ZERO_BLOCK: &[u8] = &[0u8; XLOG_BLCKSZ];
+        if self.write_lsn != pos {


How this can be true?

Shouldn't be, possible only if someone is using private API directly, functions like write_exact.

When several AppendRequest's can be read from socket without blocking, they are processed together and fsync() to segment file is only called once. Segment file is no longer opened for every write request, now last opened file is cached inside PhysicalStorage. New metric for WAL flushes was added to the storage, FLUSH_WAL_SECONDS. More errors were added to storage for non-sequential WAL writes, now write_lsn can be moved only with calls to truncate_lsn(new_lsn). New messages have been added to ProposerAcceptorMessage enum. They can't be deserialized directly and now are used only for optimizing flushes. Existing protocol wasn't changed and flush will be called for every AppendRequest, as it was before.

petuhovskiy · 2022-02-25T13:05:35Z

These results are for local run on my machine, which has relatively slow fsync (2364 ops/sec, 423 usecs/op) and not very powerful 8 core CPU. I've added "per fsync" metrics, which are useful for disk usage comparison.

I've updated PR to use fdatasync in most places, and ran this test again. It seems that my machine has also much slower fsync, compared to fdatasync, and now results are almost the same:

Test info	pgbench -i -s 200 (time)	pgbench -N -T 60	init wal_bytes, per fsync	bench wal_bytes, per fsync	bench txes, per fsync
PATCHED safekeeper + wp (fsync=on, no checkpoints), using fdatasync	99 s	4041 TPS	301kb	5439 b	13.8 tx / fsync
vanilla repl (fsync=on, no checkpoints)	149 s	4170 TPS	257kb	5674 b	14.5 tx / fsync

petuhovskiy commented Feb 23, 2022

View reviewed changes

walkeeper/src/receive_wal.rs Outdated Show resolved Hide resolved

petuhovskiy commented Feb 23, 2022

View reviewed changes

walkeeper/src/wal_storage.rs Outdated Show resolved Hide resolved

petuhovskiy force-pushed the sk-opt-fsync branch 2 times, most recently from fd866e6 to 10dbac2 Compare February 24, 2022 11:22

This was referenced Feb 24, 2022

Combine fsync for incoming AppendRequest's in safekeeper #1266

Closed

Add walproposer vs vanilla replication test #1328

Closed

petuhovskiy force-pushed the sk-opt-fsync branch from 10dbac2 to af29557 Compare February 24, 2022 15:19

petuhovskiy marked this pull request as ready for review February 24, 2022 15:19

petuhovskiy commented Feb 24, 2022

View reviewed changes

walkeeper/src/wal_storage.rs Show resolved Hide resolved

petuhovskiy force-pushed the sk-opt-fsync branch from af29557 to ac9dccd Compare February 25, 2022 11:40

arssher approved these changes Feb 25, 2022

View reviewed changes

petuhovskiy force-pushed the sk-opt-fsync branch from ac9dccd to 47bbe29 Compare February 25, 2022 12:57

petuhovskiy merged commit c8a1192 into main Feb 25, 2022

petuhovskiy deleted the sk-opt-fsync branch February 25, 2022 15:52

petuhovskiy mentioned this pull request Feb 25, 2022

Low local performance with single safekeeper compared to single synchronous replica #1141

Closed

petuhovskiy mentioned this pull request Mar 15, 2022

Refactor commit_lsn updates in safekeeper #1367

Merged

petuhovskiy mentioned this pull request Apr 14, 2022

Don't hold walproposer WAL in memory neondatabase/postgres#141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize WAL storage in safekeeper #1318

Optimize WAL storage in safekeeper #1318

Uh oh!

petuhovskiy commented Feb 23, 2022

Uh oh!

Uh oh!

Uh oh!

petuhovskiy commented Feb 24, 2022

Uh oh!

Uh oh!

petuhovskiy commented Feb 25, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arssher Feb 25, 2022

Uh oh!

petuhovskiy Feb 25, 2022

Uh oh!

Uh oh!

arssher Feb 25, 2022

Uh oh!

petuhovskiy Feb 25, 2022 •

edited

Loading

Uh oh!

petuhovskiy commented Feb 25, 2022

Uh oh!

Uh oh!

Optimize WAL storage in safekeeper #1318

Optimize WAL storage in safekeeper #1318

Uh oh!

Conversation

petuhovskiy commented Feb 23, 2022

Uh oh!

Uh oh!

Uh oh!

petuhovskiy commented Feb 24, 2022

Uh oh!

Uh oh!

petuhovskiy commented Feb 25, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arssher Feb 25, 2022

Choose a reason for hiding this comment

Uh oh!

petuhovskiy Feb 25, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arssher Feb 25, 2022

Choose a reason for hiding this comment

Uh oh!

petuhovskiy Feb 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petuhovskiy commented Feb 25, 2022

Uh oh!

Uh oh!

petuhovskiy Feb 25, 2022 •

edited

Loading