SIGSEGV on bulk upsert (10 parallel requests of 10'000 documents each) #61667

neseleznev · 2020-08-28T01:08:57Z

Elasticsearch version (bin/elasticsearch --version): 7.9.0 and 7.8.1 (docker images docker.elastic.co/elasticsearch/elasticsearch:7.9.0 and ...:7.8.1 respectively)

Plugins installed: []

JVM version (java -version): 14.0.1+7 ― Provided with both docker images

OS version (uname -a if on a Unix-like system): Linux ... 5.4.0-42-generic #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
While sending bulk upsert requests, fatal error occurs and container dies.

Steps to reproduce:

I was uploading 10_000_000 documents by chunks of 10_000 in parallel 10 threads (1000 chunks overall).
CPU load was around 800% all the time, which is expected, because I assume 10 threads ideally consume 1000% of CPU.
Suddenly after ~6mln documents inserted, I faced JVM errors and container stopped

Logs:
With 7.8.1 I faced

With 7.9.0 output was a bit different. First there was logs abiut GC degradation I suppose:

{"type": "server", "timestamp": "2020-08-28T00:48:27,563Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "25c1d19a493d", "message": "[gc][122] overhead, spent [264ms] collecting in the last [1s]", "cluster.uuid": "UgCEBjasTNWVwFbgCxG6Ew", "node.id": "6IqB6oKsRJKCu8j_Far2qg"  }

but then it failed:

Same in text:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ff3a4334812, pid=6, tid=182
#
# JRE version: OpenJDK Runtime Environment AdoptOpenJDK (14.0.1+7) (build 14.0.1+7)
# Java VM: OpenJDK 64-Bit Server VM AdoptOpenJDK (14.0.1+7, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x712812]  void G1ScanCardClosure::do_oop_work<unsigned int>(unsigned int*)+0x162
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /usr/share/elasticsearch/core.6)
#
# An error report file with more information is saved as:
# logs/hs_err_pid6.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/AdoptOpenJDK/openjdk-support/issues
#

After couple of attempts, I successfully inserted all the 10_000_000 documents

The text was updated successfully, but these errors were encountered:

neseleznev · 2020-08-28T01:12:16Z

Any help is appreciated. If error is related to the JVM, I'll be happy to go report the issue there.
If so, what mitigations are possible? Are there containers of recent elasticsearch with more stable Java versions available?

neseleznev · 2020-08-28T01:45:40Z

I tried another versions. Same with 7.7.1 , it also has AdoptOpenJDK (14.0.1+7) (build 14.0.1+7)

A bit different with 7.6.2 (13.0.2+8) but also with some JRE error

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f273d70c64a, pid=1, tid=95
#
# JRE version: OpenJDK Runtime Environment (13.0.2+8) (build 13.0.2+8)
# Java VM: OpenJDK 64-Bit Server VM (13.0.2+8, mixed mode, sharing, tiered, compressed oops, concurrent mark sweep gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xe8764a]  ContiguousSpace::object_iterate(ObjectClosure*)+0xba
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /usr/share/elasticsearch/core.1)
#
# An error report file with more information is saved as:
# logs/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/AdoptOpenJDK/openjdk-support/issues
#

neseleznev · 2020-08-28T01:51:43Z

I also tried 7.5.2 despite the fact we can't use versions older than 7.6.2, because of the Java client:
Spring Data Elasticsearch: 4.0.2.RELEASE which is bundled with Spring Boot starter 2.3.2.

It produces warning Version mismatch in between Elasticsearch Client and Cluster: 7.6.2 - 7.5.2 and I believe this is not a good idea to use it in production.

So, good news: I upserted 9,989,999 documents of 9,999,999 with last query failed. This time with an application error:

{
   "error":{
      "root_cause":[
         {
            "type":"circuit_breaking_exception",
            "reason":"[parent] Data too large, data for [<http_request>] would be [998525888/952.2mb], which is larger than the limit of [986061209/940.3mb], real usage: [990325888/944.4mb], new bytes reserved: [8200000/7.8mb], usages [request=1556536/1.4mb, fielddata=0/0b, in_flight_requests=57400000/54.7mb, accounting=1522440/1.4mb]",
            "bytes_wanted":998525888,
            "bytes_limit":986061209,
            "durability":"TRANSIENT"
         }
      ],
      "type":"circuit_breaking_exception",
      "reason":"[parent] Data too large, data for [<http_request>] would be [998525888/952.2mb], which is larger than the limit of [986061209/940.3mb], real usage: [990325888/944.4mb], new bytes reserved: [8200000/7.8mb], usages [request=1556536/1.4mb, fielddata=0/0b, in_flight_requests=57400000/54.7mb, accounting=1522440/1.4mb]",
      "bytes_wanted":998525888,
      "bytes_limit":986061209,
      "durability":"TRANSIENT"
   },
   "status":429
}

I don't really understand the origination of "bytes_wanted":998525888 because as I can see my index has size: 475Mi,
but anyway, feels like an issue which is able to overcome. By increasing some parameter or splitting to less chunks maybe.

Is it me misusing elasticsearch somehow? What is legit way to upsert 10 mln documents?
Once again, any help is appreciated

original-brownbear · 2020-08-30T16:08:59Z

hi @neseleznev

The sigsev do not necessarily look like JVM bugs but rather like an issue with your system (erroneous RAM looks like the most likely culprit here). I don't think there's anything we can do here and diagnosing this and/or helping with correctly configuring/sizing the circuit breaker is more of a user question I'm afraid.
We'd like to direct these kinds of things to the Elasticsearch forum. If you can stop by there, we'd appreciate it. This allows us to use GitHub for verified bug reports, feature requests, and pull requests.

There's an active community in the forum that should be able to help get an answer to your question. As such, I hope you don't mind that I close this.

neseleznev added >bug needs:triage Requires assignment of a team area label labels Aug 28, 2020

original-brownbear closed this as completed Aug 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV on bulk upsert (10 parallel requests of 10'000 documents each) #61667

SIGSEGV on bulk upsert (10 parallel requests of 10'000 documents each) #61667

neseleznev commented Aug 28, 2020

neseleznev commented Aug 28, 2020

neseleznev commented Aug 28, 2020

neseleznev commented Aug 28, 2020

original-brownbear commented Aug 30, 2020

SIGSEGV on bulk upsert (10 parallel requests of 10'000 documents each) #61667

SIGSEGV on bulk upsert (10 parallel requests of 10'000 documents each) #61667

Comments

neseleznev commented Aug 28, 2020

neseleznev commented Aug 28, 2020

neseleznev commented Aug 28, 2020

neseleznev commented Aug 28, 2020

original-brownbear commented Aug 30, 2020