CopyBytesSocketChannel corrupts messages under load #45444

danielmitterdorfer · 2019-08-12T10:33:01Z

In our nightly benchmarks we noticed response timeouts in several of the benchmarks since Friday's nightly run. Investigating the range of commits, we could narrow it down to e0f9d61. Starting with this commit we see corrupted responses like this (note the "_seq_noc" in the middle of the output):

{"index":{"_index":"weather-data-2016","_type":"_doc","_id":"mrbxhGwBLEbHlJDxHYLj","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":18764,"_primary_term":1,"status":201}},
{"index":{"_index":"weather-data-2016","_type":"_doc","_id":"m7bxhGwBLEbHlJDxHYLj","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_noc","_id":"CrbxhGwBLEbHlJDxHY3m","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":40023,"_primary_term":1,"status":201}}

Steps to reproduce:

This problem can be reproduced easily with Rally with the noaa track:

esrally --track=noaa --challenge=append-no-conflicts --on-error=abort --revision=e0f9d61becc45a470c97f3792872bdeb7fc9cae6

Almost immediately after starting the benchmark, we get an error message along these lines (on the client):

JSONDecodeError("Expecting ':' delimiter: line 1 column 482447 (char 482446)

indicating corrupted JSON in the HTTP response.

Additional notes

By reducing the bulk size from the default 5000 documents per bulk to e.g. 1000 documents with --track-params="bulk_size:1000" the benchmark finishes just fine so the problem does not seem to manifest with smaller requests / responses.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-08-12T10:33:03Z

Pinging @elastic/es-distributed

Currently we take the array of nio buffers from the netty channel outbound buffer and copy their bytes to a direct buffer. In the process we mutate the nio buffer positions. It seems like netty will continue to reuse these buffers. This means than any data that is not flushed in a call is lost. This commit fixes this by incrementing the positions after the flush has completed. This is similar to the behavior that SocketChannel would have provided and netty relied upon. Fixes #45444.

nchauhan5 · 2022-03-18T14:46:40Z

@jakelandis @droberts195 @mikecote @danielmitterdorfer - I am still seeing this issue with my benchmarking test with ElasticSearch version 7.17.1, python 2.7 elasticsearch library 7.17.1, and bulk updates done for 5 million records through 8 processes using multiprocessing pool.async. One of the processes always returns a JSONDEcodeError when printing ApplyResult.get(). Please note a difference between two results received from elasticsearch client -

Correct result -
{"index":{"_index":"test_index","_type":"_doc","_id":"F1pTnX8B4XBe97ExF-mU","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":61538,"_primary_term":1,"status":201}},

Incorrect Result causing the JSONDecodeError -
{"index":{"_index":"test_index","_type""_primary_term":1,"status":201}},

see how the ":" delimiter is missing after "_type" and that's whats causing the issue.

danielmitterdorfer added >bug blocker :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.4.0 labels Aug 12, 2019

danielmitterdorfer assigned Tim-Brooks Aug 12, 2019

Tim-Brooks mentioned this issue Aug 12, 2019

Fix bug in copying bytes for socket write #45463

Merged

Tim-Brooks closed this as completed in #45463 Aug 12, 2019

droberts195 mentioned this issue Aug 15, 2019

[CI] Frequent failures of MlJobIT #45587

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CopyBytesSocketChannel corrupts messages under load #45444

CopyBytesSocketChannel corrupts messages under load #45444

danielmitterdorfer commented Aug 12, 2019 •

edited

Loading

elasticmachine commented Aug 12, 2019

nchauhan5 commented Mar 18, 2022

CopyBytesSocketChannel corrupts messages under load #45444

CopyBytesSocketChannel corrupts messages under load #45444

Comments

danielmitterdorfer commented Aug 12, 2019 • edited Loading

elasticmachine commented Aug 12, 2019

nchauhan5 commented Mar 18, 2022

danielmitterdorfer commented Aug 12, 2019 •

edited

Loading