-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CopyBytesSocketChannel corrupts messages under load #45444
Comments
Pinging @elastic/es-distributed |
Currently we take the array of nio buffers from the netty channel outbound buffer and copy their bytes to a direct buffer. In the process we mutate the nio buffer positions. It seems like netty will continue to reuse these buffers. This means than any data that is not flushed in a call is lost. This commit fixes this by incrementing the positions after the flush has completed. This is similar to the behavior that SocketChannel would have provided and netty relied upon. Fixes #45444.
Currently we take the array of nio buffers from the netty channel outbound buffer and copy their bytes to a direct buffer. In the process we mutate the nio buffer positions. It seems like netty will continue to reuse these buffers. This means than any data that is not flushed in a call is lost. This commit fixes this by incrementing the positions after the flush has completed. This is similar to the behavior that SocketChannel would have provided and netty relied upon. Fixes #45444.
@jakelandis @droberts195 @mikecote @danielmitterdorfer - I am still seeing this issue with my benchmarking test with ElasticSearch version 7.17.1, python 2.7 elasticsearch library 7.17.1, and bulk updates done for 5 million records through 8 processes using multiprocessing pool.async. One of the processes always returns a JSONDEcodeError when printing ApplyResult.get(). Please note a difference between two results received from elasticsearch client - Correct result - Incorrect Result causing the JSONDecodeError - see how the ":" delimiter is missing after "_type" and that's whats causing the issue. |
In our nightly benchmarks we noticed response timeouts in several of the benchmarks since Friday's nightly run. Investigating the range of commits, we could narrow it down to e0f9d61. Starting with this commit we see corrupted responses like this (note the
"_seq_noc"
in the middle of the output):Steps to reproduce:
This problem can be reproduced easily with Rally with the
noaa
track:Almost immediately after starting the benchmark, we get an error message along these lines (on the client):
indicating corrupted JSON in the HTTP response.
Additional notes
--track-params="bulk_size:1000"
the benchmark finishes just fine so the problem does not seem to manifest with smaller requests / responses.The text was updated successfully, but these errors were encountered: