raft: make Message.Snapshot nullable, halve struct size #14592

nvanbenschoten · 2022-10-16T04:29:08Z

This commit makes the rarely used raftpb.Message.Snapshot field nullable. In doing so, it reduces the memory size of a raftpb.Message message from 264 bytes to 128 bytes — a 52% reduction in size.

While this commit does not change the protobuf encoding, it does change how that encoding is used. (gogoproto.nullable) = false instruct the generated proto marshaling logic to always encode a value for the field, even if that value is empty. (gogoproto.nullable) = true instructs the generated proto marshaling logic to omit an encoded value for the field if the field is nil.

This raises compatibility concerns in both directions. Messages encoded by new binary versions without a Snapshot field will be decoded as an empty field by old binary versions. In other words, old binary versions can't tell the difference. However, messages encoded by old binary versions with an empty Snapshot field will be decoded as a non-nil, empty field by new binary versions. As a result, new binary versions need to be prepared to handle such messages.

While Message.Snapshot is not intentionally part of the external interface of this library, it was possible for users of the library to access it and manipulate it. As such, this change may be considered a breaking change.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

codecov-commenter · 2022-10-16T05:25:09Z

Codecov Report

Merging #14592 (0f9d7a4) into main (00820f0) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main   #14592      +/-   ##
==========================================
+ Coverage   75.48%   75.52%   +0.03%     
==========================================
  Files         457      457              
  Lines       37314    37317       +3     
==========================================
+ Hits        28166    28182      +16     
+ Misses       7369     7361       -8     
+ Partials     1779     1774       -5

Flag	Coverage Δ
all	`75.52% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
raft/raft.go	`89.58% <100.00%> (+0.03%)`	⬆️
raft/util.go	`87.59% <100.00%> (ø)`
server/etcdserver/snapshot_merge.go	`75.00% <100.00%> (ø)`
server/storage/mvcc/watchable_store.go	`85.14% <0.00%> (-4.72%)`	⬇️
server/etcdserver/txn/util.go	`75.47% <0.00%> (-3.78%)`	⬇️
server/etcdserver/cluster_util.go	`70.35% <0.00%> (-3.17%)`	⬇️
client/pkg/v3/testutil/leak.go	`60.17% <0.00%> (-2.66%)`	⬇️
api/etcdserverpb/raft_internal_stringer.go	`81.72% <0.00%> (-2.16%)`	⬇️
server/etcdserver/api/rafthttp/msgappv2_codec.go	`69.56% <0.00%> (-1.74%)`	⬇️
server/etcdserver/raft.go	`88.63% <0.00%> (-1.14%)`	⬇️
... and 16 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

ahrtr · 2022-10-16T21:42:57Z

This looks like a valid improvement to me.

A couple of generic questions:

Have you verified that an etcd instance with the new raft (enhanced in this PR) can work with an etcd with the old raft?
Have you done any kinds of benchmark test and how much performance can be improved?

cc @ptabor @tbg

serathius · 2022-10-17T09:21:07Z

This commit makes the rarely used raftpb.Message.Snapshot field nullable. In doing so, it reduces the memory size of a raftpb.Message message from 264 bytes to 128 bytes — a 52% reduction in size.

If the message is rarely used why change it at all. I know 52% improvement seems great, however if we say rare means ~1/1000 proto messages are snapshot, then this i only 0.05% improvement.

This would not be as bad of an improvement if it didn't introduce breaking change concerns. Now we need to spend time to add a specific e2e tests just to ensure this doesn't break. I would say that work we will need to put is greater than improvements.

Are there no lower hanging fruits than this?

nvanbenschoten · 2022-10-17T13:35:49Z

Have you verified that an etcd instance with the new raft (enhanced in this PR) can work with an etcd with the old raft?

No, not yet. We should perform such verification. Are there cross-version tests in etcd? @serathius mentioned e2e testing, which may include mixed version testing. If no such testing exists, we can piggyback on top of CockroachDB's mixed-version testing to gain more confidence in this change.

Have you done any kinds of benchmark test and how much performance can be improved?

Also not yet. Are there any such benchmarks in etcd which exercise raft that you'd like to see? If not, I can evaluate this change using the raft benchmarking harness (rafttoy) that we developed to measure the performance impact of variations in the use of etcd/raft. The harness should also pick up on changes to etcd/raft itself.

If the message is rarely used why change it at all. I know 52% improvement seems great, however if we say rare means ~1/1000 proto messages are snapshot, then this i only 0.05% improvement.

etcd/raft exposes a message-passing interface. Every operation (proposal, log append, vote, heartbeat, snapshot, etc.) is encoded into the same raftpb.Message struct. This is the list of all message types. The structs are passed by value into various functions, packed into slices, and shipped over the network.

This commit makes the snapshot header in the message struct nullable, so that non-snapshot messages (the ~999/1000 messages that aren't snapshots) don't need to pay the cost of this large snapshot header either in-memory or over the wire. That reduces the size of the memory representation of a raftpb.Message from 264 bytes to 128 bytes (on a 64-bit architecture). It also reduces the proto encoded size of an empty raftpb.Message from 30 bytes to 18 bytes.

ahrtr · 2022-10-17T23:48:08Z

We do have cluster downgrade test, but we don't have mixed version testing, such as 3.5 mixes with 3.6 version. We need to make sure etcd cluster with mixed version can handle raft message correctly. So please add at least one e2e test for this.

Currently we only have benchmark tool (FYI. #14394 (comment)). I think we also need to verify the memory & network bandwidth usage probably by the metrics. Does it also make sense to add golang benchmark test ?

nvanbenschoten · 2022-10-21T15:52:54Z

Thanks for the pointers @ahrtr. I ran some system and micro-benchmarks.

All performance testing was run on a 30vCPU, 128GB, Intel(R) Xeon(R) CPU @ 3.10GHz GCP instance.

Results on one-server cluster

$ ./bin/etcd --quota-backend-bytes=4300000000

$ go run ./tools/benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=1000000000 --key-size=128 --val-size=1024  --total=1000000 --rate=40000

Results on main

Summary:
  Total:	31.4938 secs.
  Slowest:	0.0483 secs.
  Fastest:	0.0012 secs.
  Average:	0.0052 secs.
  Stddev:	0.0046 secs.
  Requests/sec:	31752.2840

Response time histogram:
  0.0012 [1]	|
  0.0059 [893343]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0106 [36584]	|∎
  0.0153 [12309]	|
  0.0200 [9395]	|
  0.0247 [40703]	|∎
  0.0295 [4590]	|
  0.0342 [2111]	|
  0.0389 [544]	|
  0.0436 [219]	|
  0.0483 [201]	|

Latency distribution:
  10% in 0.0035 secs.
  25% in 0.0037 secs.
  50% in 0.0039 secs.
  75% in 0.0043 secs.
  90% in 0.0062 secs.
  95% in 0.0195 secs.
  99% in 0.0242 secs.
  99.9% in 0.0341 secs.

Results on branch nvanbenschoten/nilSnapMsg

Summary:
  Total:	31.4111 secs.
  Slowest:	0.0459 secs.
  Fastest:	0.0008 secs.
  Average:	0.0052 secs.
  Stddev:	0.0045 secs.
  Requests/sec:	31835.8783

Response time histogram:
  0.0008 [1]	|
  0.0053 [875383]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0098 [49729]	|∎∎
  0.0143 [14657]	|
  0.0188 [9009]	|
  0.0234 [39721]	|∎
  0.0279 [8964]	|
  0.0324 [1364]	|
  0.0369 [746]	|
  0.0414 [359]	|
  0.0459 [67]	|

Latency distribution:
  10% in 0.0034 secs.
  25% in 0.0036 secs.
  50% in 0.0039 secs.
  75% in 0.0042 secs.
  90% in 0.0064 secs.
  95% in 0.0193 secs.
  99% in 0.0236 secs.
  99.9% in 0.0337 secs.

Results on three-server cluster

$ goreman start

$ go run ./tools/benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=1000000000 --key-size=128 --val-size=1024  --total=1000000 --rate=40000

Results on main

Summary:
  Total:	63.4617 secs.
  Slowest:	0.0578 secs.
  Fastest:	0.0034 secs.
  Average:	0.0123 secs.
  Stddev:	0.0049 secs.
  Requests/sec:	15757.5301

Response time histogram:
  0.0034 [1]	|
  0.0088 [94433]	|∎∎∎∎∎
  0.0142 [731650]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0197 [73329]	|∎∎∎∎
  0.0251 [66848]	|∎∎∎
  0.0306 [23264]	|∎
  0.0360 [6692]	|
  0.0415 [2576]	|
  0.0469 [892]	|
  0.0524 [121]	|
  0.0578 [194]	|

Latency distribution:
  10% in 0.0089 secs.
  25% in 0.0097 secs.
  50% in 0.0107 secs.
  75% in 0.0125 secs.
  90% in 0.0197 secs.
  95% in 0.0234 secs.
  99% in 0.0308 secs.
  99.9% in 0.0423 secs.

Results on branch nvanbenschoten/nilSnapMsg

Summary:
  Total:	63.4009 secs.
  Slowest:	0.1893 secs.
  Fastest:	0.0033 secs.
  Average:	0.0123 secs.
  Stddev:	0.0055 secs.
  Requests/sec:	15772.6380

Response time histogram:
  0.0033 [1]	|
  0.0219 [926219]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0405 [72671]	|∎∎∎
  0.0591 [909]	|
  0.0777 [0]	|
  0.0963 [0]	|
  0.1149 [0]	|
  0.1335 [0]	|
  0.1521 [0]	|
  0.1707 [0]	|
  0.1893 [200]	|

Latency distribution:
  10% in 0.0088 secs.
  25% in 0.0096 secs.
  50% in 0.0107 secs.
  75% in 0.0123 secs.
  90% in 0.0199 secs.
  95% in 0.0238 secs.
  99% in 0.0307 secs.
  99.9% in 0.0411 secs.

These tests show stable results and very little difference across branches. Maybe a 0.1% increase in throughput, but I did not run enough iterations to determine whether that difference is in the noise. That's mostly expected, given the size of the benchmark, its interaction with disk and network IO, and how targeted this change is.

Results on microbenchmarks

Microbenchmarks show a statistically significant and sizable improvement in proposal latency and memory allocations (count and size).

name               old time/op    new time/op    delta
Proposal3Nodes-30    3.86µs ±16%    3.69µs ±23%   -4.37%  (p=0.000 n=100+100)

name               old alloc/op   new alloc/op   delta
Proposal3Nodes-30    4.75kB ±39%    3.82kB ±73%  -19.58%  (p=0.000 n=100+100)

name               old allocs/op  new allocs/op  delta
Proposal3Nodes-30      20.4 ±18%      18.0 ±33%  -11.66%  (p=0.000 n=100+100)

ahrtr · 2022-10-23T05:03:13Z

Thanks @nvanbenschoten for the feedback, which looks good. Usually messages do not include snapshot, so it doesn't make sense to always get the snapshot header included in each message. This PR overall looks good to me. cc @ptabor and @tbg to double check.

We also need to add an mixed version testing (the main/3.6 mix with the previous version/3.5). Please let me know if you need help on this.

serathius · 2022-10-23T10:06:24Z

I had idea how to incorporate mixed version testing into common tests by just adding new test scenarios.
PTAL #14610

nvanbenschoten · 2022-10-27T19:39:54Z

We also need to add an mixed version testing (the main/3.6 mix with the previous version/3.5). Please let me know if you need help on this.

I would appreciate some help with this, as I'm not up to speed on the state of mixed-version testing in etcd and it also sounds like this is in flux. At a high level, I think a test that 1) sends a Raft snapshot from main/3.6 to previous/3.5 and then 2) sends a Raft snapshot from previous/3.5 to main/3.6 would give us confidence in this change's correctness in a mixed-version setting.

ahrtr · 2022-10-27T21:24:14Z

I would appreciate some help with this, as I'm not up to speed on the state of mixed-version testing in etcd and it also sounds like this is in flux. At a high level, I think a test that 1) sends a Raft snapshot from main/3.6 to previous/3.5 and then 2) sends a Raft snapshot from previous/3.5 to main/3.6 would give us confidence in this change's correctness in a mixed-version setting.

Sure. Let me and @serathius to take care of the e2e test.

tbg · 2022-11-07T12:36:36Z

@ahrtr did you also want to review? Going to assign you just in case.

ahrtr · 2022-11-07T19:56:17Z

@ahrtr did you also want to review? Going to assign you just in case.

We are still in progress of adding the mix-versions test. We already added some such tests in #14697, but I am planing to add more cases to cover the snapshot scenario as pointed out in #14592 (comment).

When I finish the snapshot cases, then I would request to rebase this PR. If all workflows are green, then I am OK to merge this PR.

This commit makes the rarely used `raftpb.Message.Snapshot` field nullable. In doing so, it reduces the memory size of a `raftpb.Message` message from 264 bytes to 128 bytes — a 52% reduction in size. While this commit does not change the protobuf encoding, it does change how that encoding is used. `(gogoproto.nullable) = false` instruct the generated proto marshaling logic to always encode a value for the field, even if that value is empty. `(gogoproto.nullable) = true` instructs the generated proto marshaling logic to omit an encoded value for the field if the field is nil. This raises compatibility concerns in both directions. Messages encoded by new binary versions without a `Snapshot` field will be decoded as an empty field by old binary versions. In other words, old binary versions can't tell the difference. However, messages encoded by old binary versions with an empty Snapshot field will be decoded as a non-nil, empty field by new binary versions. As a result, new binary versions need to be prepared to handle such messages. While Message.Snapshot is not intentionally part of the external interface of this library, it was possible for users of the library to access it and manipulate it. As such, this change may be considered a breaking change. Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten · 2022-11-09T18:06:59Z

When I finish the snapshot cases, then I would request to rebase this PR. If all workflows are green, then I am OK to merge this PR.

@ahrtr This is rebased with green CI. Thanks again for adding the mixed-version testing in #14697.

ahrtr

LGTM

Thank you @nvanbenschoten

I will add additional mix version test (FYI. #14707 (comment)), but this PR should be good to go.

nvanbenschoten force-pushed the nvanbenschoten/nilSnapMsg branch 2 times, most recently from 8a6a77a to 2c00b9b Compare October 16, 2022 04:51

nvanbenschoten mentioned this pull request Oct 26, 2022

raft: support asynchronous storage writes #14627

Closed

tbg approved these changes Nov 3, 2022

View reviewed changes

ahrtr added the area/raft label Nov 7, 2022

tbg requested a review from ahrtr November 7, 2022 12:36

nvanbenschoten force-pushed the nvanbenschoten/nilSnapMsg branch from 2c00b9b to a0b8df4 Compare November 9, 2022 16:17

nvanbenschoten force-pushed the nvanbenschoten/nilSnapMsg branch from a0b8df4 to 0f9d7a4 Compare November 9, 2022 17:36

ahrtr approved these changes Nov 9, 2022

View reviewed changes

ahrtr merged commit ccec27b into etcd-io:main Nov 9, 2022

nvanbenschoten deleted the nvanbenschoten/nilSnapMsg branch November 9, 2022 23:15

ahrtr mentioned this pull request Nov 18, 2022

Add mix-version testing for sending snaspshot cases #14807

Closed

pav-kv mentioned this pull request Dec 5, 2022

*: upgrade etcd/raft to 8651478c cockroachdb/cockroach#93066

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: make Message.Snapshot nullable, halve struct size #14592

raft: make Message.Snapshot nullable, halve struct size #14592

nvanbenschoten commented Oct 16, 2022

codecov-commenter commented Oct 16, 2022 •

edited

Loading

ahrtr commented Oct 16, 2022

serathius commented Oct 17, 2022

nvanbenschoten commented Oct 17, 2022

ahrtr commented Oct 17, 2022

nvanbenschoten commented Oct 21, 2022

ahrtr commented Oct 23, 2022 •

edited

Loading

serathius commented Oct 23, 2022

nvanbenschoten commented Oct 27, 2022

ahrtr commented Oct 27, 2022

tbg commented Nov 7, 2022

ahrtr commented Nov 7, 2022

nvanbenschoten commented Nov 9, 2022

ahrtr left a comment

raft: make Message.Snapshot nullable, halve struct size #14592

raft: make Message.Snapshot nullable, halve struct size #14592

Conversation

nvanbenschoten commented Oct 16, 2022

codecov-commenter commented Oct 16, 2022 • edited Loading

Codecov Report

ahrtr commented Oct 16, 2022

serathius commented Oct 17, 2022

nvanbenschoten commented Oct 17, 2022

ahrtr commented Oct 17, 2022

nvanbenschoten commented Oct 21, 2022

Results on one-server cluster

Results on three-server cluster

Results on microbenchmarks

ahrtr commented Oct 23, 2022 • edited Loading

serathius commented Oct 23, 2022

nvanbenschoten commented Oct 27, 2022

ahrtr commented Oct 27, 2022

tbg commented Nov 7, 2022

ahrtr commented Nov 7, 2022

nvanbenschoten commented Nov 9, 2022

ahrtr left a comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 16, 2022 •

edited

Loading

ahrtr commented Oct 23, 2022 •

edited

Loading