Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: make Message.Snapshot nullable, halve struct size #14592

Merged
merged 1 commit into from
Nov 9, 2022

Conversation

nvanbenschoten
Copy link
Contributor

This commit makes the rarely used raftpb.Message.Snapshot field nullable. In doing so, it reduces the memory size of a raftpb.Message message from 264 bytes to 128 bytes — a 52% reduction in size.

While this commit does not change the protobuf encoding, it does change how that encoding is used. (gogoproto.nullable) = false instruct the generated proto marshaling logic to always encode a value for the field, even if that value is empty. (gogoproto.nullable) = true instructs the generated proto marshaling logic to omit an encoded value for the field if the field is nil.

This raises compatibility concerns in both directions. Messages encoded by new binary versions without a Snapshot field will be decoded as an empty field by old binary versions. In other words, old binary versions can't tell the difference. However, messages encoded by old binary versions with an empty Snapshot field will be decoded as a non-nil, empty field by new binary versions. As a result, new binary versions need to be prepared to handle such messages.

While Message.Snapshot is not intentionally part of the external interface of this library, it was possible for users of the library to access it and manipulate it. As such, this change may be considered a breaking change.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/nilSnapMsg branch 2 times, most recently from 8a6a77a to 2c00b9b Compare October 16, 2022 04:51
@codecov-commenter
Copy link

codecov-commenter commented Oct 16, 2022

Codecov Report

Merging #14592 (0f9d7a4) into main (00820f0) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main   #14592      +/-   ##
==========================================
+ Coverage   75.48%   75.52%   +0.03%     
==========================================
  Files         457      457              
  Lines       37314    37317       +3     
==========================================
+ Hits        28166    28182      +16     
+ Misses       7369     7361       -8     
+ Partials     1779     1774       -5     
Flag Coverage Δ
all 75.52% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
raft/raft.go 89.58% <100.00%> (+0.03%) ⬆️
raft/util.go 87.59% <100.00%> (ø)
server/etcdserver/snapshot_merge.go 75.00% <100.00%> (ø)
server/storage/mvcc/watchable_store.go 85.14% <0.00%> (-4.72%) ⬇️
server/etcdserver/txn/util.go 75.47% <0.00%> (-3.78%) ⬇️
server/etcdserver/cluster_util.go 70.35% <0.00%> (-3.17%) ⬇️
client/pkg/v3/testutil/leak.go 60.17% <0.00%> (-2.66%) ⬇️
api/etcdserverpb/raft_internal_stringer.go 81.72% <0.00%> (-2.16%) ⬇️
server/etcdserver/api/rafthttp/msgappv2_codec.go 69.56% <0.00%> (-1.74%) ⬇️
server/etcdserver/raft.go 88.63% <0.00%> (-1.14%) ⬇️
... and 16 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ahrtr
Copy link
Member

ahrtr commented Oct 16, 2022

This looks like a valid improvement to me.

A couple of generic questions:

  1. Have you verified that an etcd instance with the new raft (enhanced in this PR) can work with an etcd with the old raft?
  2. Have you done any kinds of benchmark test and how much performance can be improved?

cc @ptabor @tbg

@serathius
Copy link
Member

This commit makes the rarely used raftpb.Message.Snapshot field nullable. In doing so, it reduces the memory size of a raftpb.Message message from 264 bytes to 128 bytes — a 52% reduction in size.

If the message is rarely used why change it at all. I know 52% improvement seems great, however if we say rare means ~1/1000 proto messages are snapshot, then this i only 0.05% improvement.

This would not be as bad of an improvement if it didn't introduce breaking change concerns. Now we need to spend time to add a specific e2e tests just to ensure this doesn't break. I would say that work we will need to put is greater than improvements.

Are there no lower hanging fruits than this?

@nvanbenschoten
Copy link
Contributor Author

Have you verified that an etcd instance with the new raft (enhanced in this PR) can work with an etcd with the old raft?

No, not yet. We should perform such verification. Are there cross-version tests in etcd? @serathius mentioned e2e testing, which may include mixed version testing. If no such testing exists, we can piggyback on top of CockroachDB's mixed-version testing to gain more confidence in this change.

Have you done any kinds of benchmark test and how much performance can be improved?

Also not yet. Are there any such benchmarks in etcd which exercise raft that you'd like to see? If not, I can evaluate this change using the raft benchmarking harness (rafttoy) that we developed to measure the performance impact of variations in the use of etcd/raft. The harness should also pick up on changes to etcd/raft itself.

If the message is rarely used why change it at all. I know 52% improvement seems great, however if we say rare means ~1/1000 proto messages are snapshot, then this i only 0.05% improvement.

etcd/raft exposes a message-passing interface. Every operation (proposal, log append, vote, heartbeat, snapshot, etc.) is encoded into the same raftpb.Message struct. This is the list of all message types. The structs are passed by value into various functions, packed into slices, and shipped over the network.

This commit makes the snapshot header in the message struct nullable, so that non-snapshot messages (the ~999/1000 messages that aren't snapshots) don't need to pay the cost of this large snapshot header either in-memory or over the wire. That reduces the size of the memory representation of a raftpb.Message from 264 bytes to 128 bytes (on a 64-bit architecture). It also reduces the proto encoded size of an empty raftpb.Message from 30 bytes to 18 bytes.

@ahrtr
Copy link
Member

ahrtr commented Oct 17, 2022

We do have cluster downgrade test, but we don't have mixed version testing, such as 3.5 mixes with 3.6 version. We need to make sure etcd cluster with mixed version can handle raft message correctly. So please add at least one e2e test for this.

Currently we only have benchmark tool (FYI. #14394 (comment)). I think we also need to verify the memory & network bandwidth usage probably by the metrics. Does it also make sense to add golang benchmark test ?

@nvanbenschoten
Copy link
Contributor Author

Thanks for the pointers @ahrtr. I ran some system and micro-benchmarks.

All performance testing was run on a 30vCPU, 128GB, Intel(R) Xeon(R) CPU @ 3.10GHz GCP instance.

Results on one-server cluster

$ ./bin/etcd --quota-backend-bytes=4300000000

$ go run ./tools/benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=1000000000 --key-size=128 --val-size=1024  --total=1000000 --rate=40000

Results on main

Summary:
  Total:	31.4938 secs.
  Slowest:	0.0483 secs.
  Fastest:	0.0012 secs.
  Average:	0.0052 secs.
  Stddev:	0.0046 secs.
  Requests/sec:	31752.2840

Response time histogram:
  0.0012 [1]	|
  0.0059 [893343]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0106 [36584]	|∎
  0.0153 [12309]	|
  0.0200 [9395]	|
  0.0247 [40703]	|∎
  0.0295 [4590]	|
  0.0342 [2111]	|
  0.0389 [544]	|
  0.0436 [219]	|
  0.0483 [201]	|

Latency distribution:
  10% in 0.0035 secs.
  25% in 0.0037 secs.
  50% in 0.0039 secs.
  75% in 0.0043 secs.
  90% in 0.0062 secs.
  95% in 0.0195 secs.
  99% in 0.0242 secs.
  99.9% in 0.0341 secs.

Results on branch nvanbenschoten/nilSnapMsg

Summary:
  Total:	31.4111 secs.
  Slowest:	0.0459 secs.
  Fastest:	0.0008 secs.
  Average:	0.0052 secs.
  Stddev:	0.0045 secs.
  Requests/sec:	31835.8783

Response time histogram:
  0.0008 [1]	|
  0.0053 [875383]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0098 [49729]	|∎∎
  0.0143 [14657]	|
  0.0188 [9009]	|
  0.0234 [39721]	|∎
  0.0279 [8964]	|
  0.0324 [1364]	|
  0.0369 [746]	|
  0.0414 [359]	|
  0.0459 [67]	|

Latency distribution:
  10% in 0.0034 secs.
  25% in 0.0036 secs.
  50% in 0.0039 secs.
  75% in 0.0042 secs.
  90% in 0.0064 secs.
  95% in 0.0193 secs.
  99% in 0.0236 secs.
  99.9% in 0.0337 secs.

Results on three-server cluster

$ goreman start

$ go run ./tools/benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=1000000000 --key-size=128 --val-size=1024  --total=1000000 --rate=40000

Results on main

Summary:
  Total:	63.4617 secs.
  Slowest:	0.0578 secs.
  Fastest:	0.0034 secs.
  Average:	0.0123 secs.
  Stddev:	0.0049 secs.
  Requests/sec:	15757.5301

Response time histogram:
  0.0034 [1]	|
  0.0088 [94433]	|∎∎∎∎∎
  0.0142 [731650]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0197 [73329]	|∎∎∎∎
  0.0251 [66848]	|∎∎∎
  0.0306 [23264]	|∎
  0.0360 [6692]	|
  0.0415 [2576]	|
  0.0469 [892]	|
  0.0524 [121]	|
  0.0578 [194]	|

Latency distribution:
  10% in 0.0089 secs.
  25% in 0.0097 secs.
  50% in 0.0107 secs.
  75% in 0.0125 secs.
  90% in 0.0197 secs.
  95% in 0.0234 secs.
  99% in 0.0308 secs.
  99.9% in 0.0423 secs.

Results on branch nvanbenschoten/nilSnapMsg

Summary:
  Total:	63.4009 secs.
  Slowest:	0.1893 secs.
  Fastest:	0.0033 secs.
  Average:	0.0123 secs.
  Stddev:	0.0055 secs.
  Requests/sec:	15772.6380

Response time histogram:
  0.0033 [1]	|
  0.0219 [926219]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0405 [72671]	|∎∎∎
  0.0591 [909]	|
  0.0777 [0]	|
  0.0963 [0]	|
  0.1149 [0]	|
  0.1335 [0]	|
  0.1521 [0]	|
  0.1707 [0]	|
  0.1893 [200]	|

Latency distribution:
  10% in 0.0088 secs.
  25% in 0.0096 secs.
  50% in 0.0107 secs.
  75% in 0.0123 secs.
  90% in 0.0199 secs.
  95% in 0.0238 secs.
  99% in 0.0307 secs.
  99.9% in 0.0411 secs.

These tests show stable results and very little difference across branches. Maybe a 0.1% increase in throughput, but I did not run enough iterations to determine whether that difference is in the noise. That's mostly expected, given the size of the benchmark, its interaction with disk and network IO, and how targeted this change is.

Results on microbenchmarks

Microbenchmarks show a statistically significant and sizable improvement in proposal latency and memory allocations (count and size).

name               old time/op    new time/op    delta
Proposal3Nodes-30    3.86µs ±16%    3.69µs ±23%   -4.37%  (p=0.000 n=100+100)

name               old alloc/op   new alloc/op   delta
Proposal3Nodes-30    4.75kB ±39%    3.82kB ±73%  -19.58%  (p=0.000 n=100+100)

name               old allocs/op  new allocs/op  delta
Proposal3Nodes-30      20.4 ±18%      18.0 ±33%  -11.66%  (p=0.000 n=100+100)

@ahrtr
Copy link
Member

ahrtr commented Oct 23, 2022

Thanks @nvanbenschoten for the feedback, which looks good. Usually messages do not include snapshot, so it doesn't make sense to always get the snapshot header included in each message. This PR overall looks good to me. cc @ptabor and @tbg to double check.

We also need to add an mixed version testing (the main/3.6 mix with the previous version/3.5). Please let me know if you need help on this.

@serathius
Copy link
Member

I had idea how to incorporate mixed version testing into common tests by just adding new test scenarios.
PTAL #14610

@nvanbenschoten
Copy link
Contributor Author

We also need to add an mixed version testing (the main/3.6 mix with the previous version/3.5). Please let me know if you need help on this.

I would appreciate some help with this, as I'm not up to speed on the state of mixed-version testing in etcd and it also sounds like this is in flux. At a high level, I think a test that 1) sends a Raft snapshot from main/3.6 to previous/3.5 and then 2) sends a Raft snapshot from previous/3.5 to main/3.6 would give us confidence in this change's correctness in a mixed-version setting.

@ahrtr
Copy link
Member

ahrtr commented Oct 27, 2022

I would appreciate some help with this, as I'm not up to speed on the state of mixed-version testing in etcd and it also sounds like this is in flux. At a high level, I think a test that 1) sends a Raft snapshot from main/3.6 to previous/3.5 and then 2) sends a Raft snapshot from previous/3.5 to main/3.6 would give us confidence in this change's correctness in a mixed-version setting.

Sure. Let me and @serathius to take care of the e2e test.

@ahrtr ahrtr added the area/raft label Nov 7, 2022
@tbg
Copy link
Contributor

tbg commented Nov 7, 2022

@ahrtr did you also want to review? Going to assign you just in case.

@tbg tbg requested a review from ahrtr November 7, 2022 12:36
@ahrtr
Copy link
Member

ahrtr commented Nov 7, 2022

@ahrtr did you also want to review? Going to assign you just in case.

We are still in progress of adding the mix-versions test. We already added some such tests in #14697, but I am planing to add more cases to cover the snapshot scenario as pointed out in #14592 (comment).

When I finish the snapshot cases, then I would request to rebase this PR. If all workflows are green, then I am OK to merge this PR.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/nilSnapMsg branch from 2c00b9b to a0b8df4 Compare November 9, 2022 16:17
This commit makes the rarely used `raftpb.Message.Snapshot` field nullable.
In doing so, it reduces the memory size of a `raftpb.Message` message from
264 bytes to 128 bytes — a 52% reduction in size.

While this commit does not change the protobuf encoding, it does change
how that encoding is used. `(gogoproto.nullable) = false` instruct the
generated proto marshaling logic to always encode a value for the field,
even if that value is empty. `(gogoproto.nullable) = true` instructs the
generated proto marshaling logic to omit an encoded value for the field
if the field is nil.

This raises compatibility concerns in both directions. Messages encoded
by new binary versions without a `Snapshot` field will be decoded as an
empty field by old binary versions. In other words, old binary versions
can't tell the difference. However, messages encoded by old binary versions
with an empty Snapshot field will be decoded as a non-nil, empty field by
new binary versions. As a result, new binary versions need to be prepared
to handle such messages.

While Message.Snapshot is not intentionally part of the external interface
of this library, it was possible for users of the library to access it and
manipulate it. As such, this change may be considered a breaking change.

Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/nilSnapMsg branch from a0b8df4 to 0f9d7a4 Compare November 9, 2022 17:36
@nvanbenschoten
Copy link
Contributor Author

When I finish the snapshot cases, then I would request to rebase this PR. If all workflows are green, then I am OK to merge this PR.

@ahrtr This is rebased with green CI. Thanks again for adding the mixed-version testing in #14697.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you @nvanbenschoten

I will add additional mix version test (FYI. #14707 (comment)), but this PR should be good to go.

@ahrtr ahrtr merged commit ccec27b into etcd-io:main Nov 9, 2022
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/nilSnapMsg branch November 9, 2022 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants