-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce recovery time with compress or secure transport #36981
Conversation
Pinging @elastic/es-distributed |
run gradle build tests 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some minor suggestions
(probably still best to wait for a second opinion from someone else :))
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySettings.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/indices/recovery/RecoverySourceHandlerTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/recovery/RelocationIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetServiceTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetServiceTests.java
Outdated
Show resolved
Hide resolved
@original-brownbear Thanks for looking. I've addressed all your comments :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dnhatn. I like the simplicity of this PR. I've left some comments and would like to ask you to also run tests with tcp.compress
enabled (both for TLS enabled/disabled). It would also be good to see how much throughput we get at these numbers. For example, can we saturate a 10Gbit network with num_chunks = 2
?
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySettings.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySettings.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Show resolved
Hide resolved
@ywelsch I have addressed all your comments. Would you please have another look? |
I ran a more realistic benchmark which consists of two GCP instances (8CPU, 32GB RAM, 10GiB bandwidth and local SSD). Below are the results for PMC and NYC_taxis datasets.
The current approach does not reduce the recovery time much with compression because the compressing is too expensive. It takes 20 minutes to compress 20GB of Lucene index and compressing happens on (and blocks) the recovery thread. The recovery time is reduced linearly with the
|
Note that by default we still throttle the sending of chunks, with which the user can control how much CPU to trade for recovery throughput.
Is the index using |
No, the index uses the default index_codec and no force_merged is called. |
With the recent change, it does not pipeline sending requests for different files (i.e. one file needs to be completed before we start with next one)? |
Yes, the previous change and this change both wait for the completion of the current file before sending the next file. This is because we wait for all outstanding requests when we close the current RecoveryOutputStream. |
How difficult would it be to lift that limitation? |
@ywelsch I have responded to your comments. Please give this PR another shot. Thank you. |
@ywelsch Thanks so much for proposing this idea and reviewing. Thanks @original-brownbear. |
* master: (28 commits) Introduce retention lease serialization (elastic#37447) Update Delete Watch to allow unknown fields (elastic#37435) Make finalize step of recovery source non-blocking (elastic#37388) Update the default for include_type_name to false. (elastic#37285) Security: remove SSL settings fallback (elastic#36846) Adding mapping for hostname field (elastic#37288) Relax assertSameDocIdsOnShards assertion Reduce recovery time with compress or secure transport (elastic#36981) Implement ccr file restore (elastic#37130) Fix Eclipse specific compilation issue (elastic#37419) Performance fix. Reduce deprecation calls for the same bulk request (elastic#37415) [ML] Use String rep of Version in map for serialisation (elastic#37416) Cleanup Deadcode in Rest Tests (elastic#37418) Mute IndexShardRetentionLeaseTests.testCommit elastic#37420 unmuted test Remove unused index store in directory service Improve CloseWhileRelocatingShardsIT (elastic#37348) Fix ClusterBlock serialization and Close Index API logic after backport to 6.x (elastic#37360) Update the scroll example in the docs (elastic#37394) Update analysis.asciidoc (elastic#37404) ...
Today file-chunks are sent sequentially one by one in peer-recovery. This is a correct choice since the implementation is straightforward and recovery is network bound in most of the time. However, if the connection is encrypted, we might not be able to saturate the network pipe because encrypting/decrypting are cpu bound rather than network-bound. With this commit, a source node can send multiple (default to 2) file-chunks without waiting for the acknowledgments from the target. Below are the benchmark results for PMC and NYC_taxis. - PMC (20.2 GB) | Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 | | ----------| ---------| -------- | -------- | -------- | -------- | | Plain | 184s | 137s | 106s | 105s | 106s | | TLS | 346s | 294s | 176s | 153s | 117s | | Compress | 1556s | 1407s | 1193s | 1183s | 1211s | - NYC_Taxis (38.6GB) | Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 | | ----------| ---------| ---------| ---------| ---------| -------- | | Plain | 321s | 249s | 191s | * | * | | TLS | 618s | 539s | 323s | 290s | 213s | | Compress | 2622s | 2421s | 2018s | 2029s | n/a | Relates #33844
Today file-chunks are sent sequentially one by one in peer-recovery. This is a correct choice since the implementation is straightforward and recovery is network bound in most of the time. However, if the transport communication is secure, we might not be able to saturate the network bandwidth because encrypting/decrypting are compute-intensive.
With this commit, a source node can send multiple (default to 2) file-chunks without waiting for the acknowledgments from the target.
Below are the benchmark results for PMC and NYC_taxis datasets. The benchmark consists of two GCP instances (8CPU, 32GB RAM, 12GiB bandwidth and local SSD).
Relates #33844