Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce recovery time with compress or secure transport #36981

Merged
merged 19 commits into from
Jan 14, 2019

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Dec 24, 2018

Today file-chunks are sent sequentially one by one in peer-recovery. This is a correct choice since the implementation is straightforward and recovery is network bound in most of the time. However, if the transport communication is secure, we might not be able to saturate the network bandwidth because encrypting/decrypting are compute-intensive.

With this commit, a source node can send multiple (default to 2) file-chunks without waiting for the acknowledgments from the target.

Below are the benchmark results for PMC and NYC_taxis datasets. The benchmark consists of two GCP instances (8CPU, 32GB RAM, 12GiB bandwidth and local SSD).

  • PMC (20.2 GB)
Transport Baseline chunks=1 chunks=2 chunks=3 chunks=4
Plain 184s 137s 106s 105s 106s
TLS 346s 294s 176s 153s 117s
Compress 1556s 1407s 1193s 1183s 1211s
Compress + TLS n/a n/a n/a n/a n/a
  • NYC_Taxis (38.6GB)
Transport Baseline chunks=1 chunks=2 chunks=3 chunks=4
Plain 321s 249s 191s * *
TLS 618s 539s 323s 290s 213s
Compress 2622s 2421s 2018s 2029s n/a
Compress + TLS n/a n/a n/a n/a n/a

Relates #33844

@dnhatn dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.7.0 labels Dec 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn requested review from s1monw, bleskes and ywelsch December 24, 2018 21:55
@dnhatn
Copy link
Member Author

dnhatn commented Dec 24, 2018

run gradle build tests 2

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some minor suggestions
(probably still best to wait for a second opinion from someone else :))

@dnhatn
Copy link
Member Author

dnhatn commented Dec 27, 2018

@original-brownbear Thanks for looking. I've addressed all your comments :).

@ywelsch ywelsch requested review from jasontedor and removed request for s1monw and bleskes December 27, 2018 12:38
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn. I like the simplicity of this PR. I've left some comments and would like to ask you to also run tests with tcp.compress enabled (both for TLS enabled/disabled). It would also be good to see how much throughput we get at these numbers. For example, can we saturate a 10Gbit network with num_chunks = 2?

@dnhatn
Copy link
Member Author

dnhatn commented Jan 2, 2019

@ywelsch I have addressed all your comments. Would you please have another look?

@dnhatn dnhatn requested a review from ywelsch January 2, 2019 07:09
@dnhatn
Copy link
Member Author

dnhatn commented Jan 2, 2019

I ran a more realistic benchmark which consists of two GCP instances (8CPU, 32GB RAM, 10GiB bandwidth and local SSD). Below are the results for PMC and NYC_taxis datasets.

  • PMC (20.2 GB)
Transport Baseline chunks=1 chunks=2 chunks=3 chunks=4
Plain 184s 137s 106s 105s 106s
TLS 346s 294s 176s 153s 117s
Compress 1556s 1407s 1193s 1183s 1211s
Compress + TLS n/a n/a n/a n/a n/a
  • NYC_Taxis (38.6GB)
Transport Baseline chunks=1 chunks=2 chunks=3 chunks=4
Plain 321s 249s 191s * *
TLS 618s 539s 323s 290s 213s
Compress 2622s 2421s 2018s 2029s n/a
Compress + TLS n/a n/a n/a n/a n/a

The current approach does not reduce the recovery time much with compression because the compressing is too expensive. It takes 20 minutes to compress 20GB of Lucene index and compressing happens on (and blocks) the recovery thread. The recovery time is reduced linearly with the max_concurrent_file_chunks if we compress the file chunk requests in parallel. However, I don't think we should do it as it would take all the CPU which we should reserve for higher priority tasks such as search. Then should we disable compression for the file-chunk requests regardless of the compression setting given that it's too expensive and only saves around 16% of bandwidth? Notes that we compress chunk by chunk, not the whole file. Here is the benchmark that sends file-chunk requests in parallel for PMC.

Transport Baseline par_chunks=2 par_chunks=3 par_chunks=4
Compress 1556s 737s 514s 207s

@ywelsch
Copy link
Contributor

ywelsch commented Jan 3, 2019

The recovery time is reduced linearly with the max_concurrent_file_chunks if we compress the file chunk requests in parallel. However, I don't think we should do it as it would take all the CPU which we should reserve for higher priority tasks such as search.

Note that by default we still throttle the sending of chunks, with which the user can control how much CPU to trade for recovery throughput.

only saves around 16% of bandwidth?

Is the index using best_compression? Is it force-merged?

@dnhatn
Copy link
Member Author

dnhatn commented Jan 3, 2019

Is the index using best_compression? Is it force-merged?

No, the index uses the default index_codec and no force_merged is called.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 7, 2019

With the recent change, it does not pipeline sending requests for different files (i.e. one file needs to be completed before we start with next one)?

@dnhatn
Copy link
Member Author

dnhatn commented Jan 7, 2019

i.e. one file needs to be completed before we start with next one

Yes, the previous change and this change both wait for the completion of the current file before sending the next file. This is because we wait for all outstanding requests when we close the current RecoveryOutputStream.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 7, 2019

How difficult would it be to lift that limitation?

@dnhatn
Copy link
Member Author

dnhatn commented Jan 13, 2019

@ywelsch
I re-ran the recovery benchmark. The new implementation is slightly (but insignificantly) faster than the previous implementation. This is because we continue sending file chunks in parallel even when switching files. I also verified the rate limiter with max_concurrent_file_chunks with max_bytes_per_sec=10mb, 40mb and 100mb. It works perfectly. You can find the result here.

I have responded to your comments. Please give this PR another shot. Thank you.

@dnhatn dnhatn requested a review from ywelsch January 13, 2019 22:56
docs/reference/modules/indices/recovery.asciidoc Outdated Show resolved Hide resolved
docs/reference/modules/indices/recovery.asciidoc Outdated Show resolved Hide resolved
@dnhatn
Copy link
Member Author

dnhatn commented Jan 14, 2019

@ywelsch Thanks so much for proposing this idea and reviewing. Thanks @original-brownbear.

@dnhatn dnhatn merged commit 15aa376 into elastic:master Jan 14, 2019
@dnhatn dnhatn deleted the file-chunks branch January 14, 2019 20:14
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 15, 2019
* master: (28 commits)
  Introduce retention lease serialization (elastic#37447)
  Update Delete Watch to allow unknown fields (elastic#37435)
  Make finalize step of recovery source non-blocking (elastic#37388)
  Update the default for include_type_name to false. (elastic#37285)
  Security: remove SSL settings fallback (elastic#36846)
  Adding mapping for hostname field (elastic#37288)
  Relax assertSameDocIdsOnShards assertion
  Reduce recovery time with compress or secure transport (elastic#36981)
  Implement ccr file restore (elastic#37130)
  Fix Eclipse specific compilation issue (elastic#37419)
  Performance fix. Reduce deprecation calls for the same bulk request (elastic#37415)
  [ML] Use String rep of Version in map for serialisation (elastic#37416)
  Cleanup Deadcode in Rest Tests (elastic#37418)
  Mute IndexShardRetentionLeaseTests.testCommit elastic#37420
  unmuted test
  Remove unused index store in directory service
  Improve CloseWhileRelocatingShardsIT (elastic#37348)
  Fix ClusterBlock serialization and Close Index API logic after backport to 6.x (elastic#37360)
  Update the scroll example in the docs (elastic#37394)
  Update analysis.asciidoc (elastic#37404)
  ...
dnhatn added a commit that referenced this pull request Jan 15, 2019
Today file-chunks are sent sequentially one by one in peer-recovery. This is a
correct choice since the implementation is straightforward and recovery is
network bound in most of the time. However, if the connection is encrypted, we
might not be able to saturate the network pipe because encrypting/decrypting
are cpu bound rather than network-bound.

With this commit, a source node can send multiple (default to 2) file-chunks
without waiting for the acknowledgments from the target.

Below are the benchmark results for PMC and NYC_taxis.

- PMC (20.2 GB)

| Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 |
| ----------| ---------| -------- | -------- | -------- | -------- |
| Plain     | 184s     | 137s     | 106s     | 105s     | 106s     |
| TLS       | 346s     | 294s     | 176s     | 153s     | 117s     |
| Compress  | 1556s    | 1407s    | 1193s    | 1183s    | 1211s    |

- NYC_Taxis (38.6GB)

| Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 |
| ----------| ---------| ---------| ---------| ---------| -------- |
| Plain     | 321s     | 249s     | 191s     |  *       | *        |
| TLS       | 618s     | 539s     | 323s     | 290s     | 213s     |
| Compress  | 2622s    | 2421s    | 2018s    | 2029s    | n/a      |

Relates #33844
dnhatn added a commit that referenced this pull request Jan 15, 2019
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 11, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 11, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 11, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 11, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 11, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 12, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 12, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 12, 2019
kovrus added a commit to crate/crate that referenced this pull request Sep 12, 2019
mergify bot pushed a commit to crate/crate that referenced this pull request Sep 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants