Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CopyBytesSocketChannel corrupts messages under load #45444

Closed
danielmitterdorfer opened this issue Aug 12, 2019 · 2 comments · Fixed by #45463
Closed

CopyBytesSocketChannel corrupts messages under load #45444

danielmitterdorfer opened this issue Aug 12, 2019 · 2 comments · Fixed by #45463
Assignees
Labels

Comments

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Aug 12, 2019

In our nightly benchmarks we noticed response timeouts in several of the benchmarks since Friday's nightly run. Investigating the range of commits, we could narrow it down to e0f9d61. Starting with this commit we see corrupted responses like this (note the "_seq_noc" in the middle of the output):

{"index":{"_index":"weather-data-2016","_type":"_doc","_id":"mrbxhGwBLEbHlJDxHYLj","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":18764,"_primary_term":1,"status":201}},
{"index":{"_index":"weather-data-2016","_type":"_doc","_id":"m7bxhGwBLEbHlJDxHYLj","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_noc","_id":"CrbxhGwBLEbHlJDxHY3m","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":40023,"_primary_term":1,"status":201}}

Steps to reproduce:

This problem can be reproduced easily with Rally with the noaa track:

esrally --track=noaa --challenge=append-no-conflicts --on-error=abort --revision=e0f9d61becc45a470c97f3792872bdeb7fc9cae6

Almost immediately after starting the benchmark, we get an error message along these lines (on the client):

JSONDecodeError("Expecting ':' delimiter: line 1 column 482447 (char 482446)

indicating corrupted JSON in the HTTP response.

Additional notes

  • By reducing the bulk size from the default 5000 documents per bulk to e.g. 1000 documents with --track-params="bulk_size:1000" the benchmark finishes just fine so the problem does not seem to manifest with smaller requests / responses.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Tim-Brooks added a commit that referenced this issue Aug 12, 2019
Currently we take the array of nio buffers from the netty channel
outbound buffer and copy their bytes to a direct buffer. In the process
we mutate the nio buffer positions. It seems like netty will continue to
reuse these buffers. This means than any data that is not flushed in a
call is lost. This commit fixes this by incrementing the positions after
the flush has completed. This is similar to the behavior that
SocketChannel would have provided and netty relied upon.

Fixes #45444.
Tim-Brooks added a commit that referenced this issue Aug 12, 2019
Currently we take the array of nio buffers from the netty channel
outbound buffer and copy their bytes to a direct buffer. In the process
we mutate the nio buffer positions. It seems like netty will continue to
reuse these buffers. This means than any data that is not flushed in a
call is lost. This commit fixes this by incrementing the positions after
the flush has completed. This is similar to the behavior that
SocketChannel would have provided and netty relied upon.

Fixes #45444.
@nchauhan5
Copy link

@jakelandis @droberts195 @mikecote @danielmitterdorfer - I am still seeing this issue with my benchmarking test with ElasticSearch version 7.17.1, python 2.7 elasticsearch library 7.17.1, and bulk updates done for 5 million records through 8 processes using multiprocessing pool.async. One of the processes always returns a JSONDEcodeError when printing ApplyResult.get(). Please note a difference between two results received from elasticsearch client -

Correct result -
{"index":{"_index":"test_index","_type":"_doc","_id":"F1pTnX8B4XBe97ExF-mU","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":61538,"_primary_term":1,"status":201}},

Incorrect Result causing the JSONDecodeError -
{"index":{"_index":"test_index","_type""_primary_term":1,"status":201}},

see how the ":" delimiter is missing after "_type" and that's whats causing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants