Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV on bulk upsert (10 parallel requests of 10'000 documents each) #61667

Closed
neseleznev opened this issue Aug 28, 2020 · 4 comments
Closed
Labels
>bug needs:triage Requires assignment of a team area label

Comments

@neseleznev
Copy link

Elasticsearch version (bin/elasticsearch --version): 7.9.0 and 7.8.1 (docker images docker.elastic.co/elasticsearch/elasticsearch:7.9.0 and ...:7.8.1 respectively)

Plugins installed: []

JVM version (java -version): 14.0.1+7 ― Provided with both docker images

OS version (uname -a if on a Unix-like system): Linux ... 5.4.0-42-generic #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
While sending bulk upsert requests, fatal error occurs and container dies.

Steps to reproduce:

  1. I was uploading 10_000_000 documents by chunks of 10_000 in parallel 10 threads (1000 chunks overall).
    CPU load was around 800% all the time, which is expected, because I assume 10 threads ideally consume 1000% of CPU.
    image

  2. Suddenly after ~6mln documents inserted, I faced JVM errors and container stopped
    image

Logs:
With 7.8.1 I faced
image

With 7.9.0 output was a bit different. First there was logs abiut GC degradation I suppose:

{"type": "server", "timestamp": "2020-08-28T00:48:27,563Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "25c1d19a493d", "message": "[gc][122] overhead, spent [264ms] collecting in the last [1s]", "cluster.uuid": "UgCEBjasTNWVwFbgCxG6Ew", "node.id": "6IqB6oKsRJKCu8j_Far2qg"  }

but then it failed:
image

Same in text:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ff3a4334812, pid=6, tid=182
#
# JRE version: OpenJDK Runtime Environment AdoptOpenJDK (14.0.1+7) (build 14.0.1+7)
# Java VM: OpenJDK 64-Bit Server VM AdoptOpenJDK (14.0.1+7, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x712812]  void G1ScanCardClosure::do_oop_work<unsigned int>(unsigned int*)+0x162
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /usr/share/elasticsearch/core.6)
#
# An error report file with more information is saved as:
# logs/hs_err_pid6.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/AdoptOpenJDK/openjdk-support/issues
#

After couple of attempts, I successfully inserted all the 10_000_000 documents

@neseleznev neseleznev added >bug needs:triage Requires assignment of a team area label labels Aug 28, 2020
@neseleznev
Copy link
Author

Any help is appreciated. If error is related to the JVM, I'll be happy to go report the issue there.
If so, what mitigations are possible? Are there containers of recent elasticsearch with more stable Java versions available?

@neseleznev
Copy link
Author

I tried another versions. Same with 7.7.1 , it also has AdoptOpenJDK (14.0.1+7) (build 14.0.1+7)

A bit different with 7.6.2 (13.0.2+8) but also with some JRE error

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f273d70c64a, pid=1, tid=95
#
# JRE version: OpenJDK Runtime Environment (13.0.2+8) (build 13.0.2+8)
# Java VM: OpenJDK 64-Bit Server VM (13.0.2+8, mixed mode, sharing, tiered, compressed oops, concurrent mark sweep gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xe8764a]  ContiguousSpace::object_iterate(ObjectClosure*)+0xba
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /usr/share/elasticsearch/core.1)
#
# An error report file with more information is saved as:
# logs/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/AdoptOpenJDK/openjdk-support/issues
#

@neseleznev
Copy link
Author

I also tried 7.5.2 despite the fact we can't use versions older than 7.6.2, because of the Java client:
Spring Data Elasticsearch: 4.0.2.RELEASE which is bundled with Spring Boot starter 2.3.2.

It produces warning Version mismatch in between Elasticsearch Client and Cluster: 7.6.2 - 7.5.2 and I believe this is not a good idea to use it in production.

So, good news: I upserted 9,989,999 documents of 9,999,999 with last query failed. This time with an application error:

{
   "error":{
      "root_cause":[
         {
            "type":"circuit_breaking_exception",
            "reason":"[parent] Data too large, data for [<http_request>] would be [998525888/952.2mb], which is larger than the limit of [986061209/940.3mb], real usage: [990325888/944.4mb], new bytes reserved: [8200000/7.8mb], usages [request=1556536/1.4mb, fielddata=0/0b, in_flight_requests=57400000/54.7mb, accounting=1522440/1.4mb]",
            "bytes_wanted":998525888,
            "bytes_limit":986061209,
            "durability":"TRANSIENT"
         }
      ],
      "type":"circuit_breaking_exception",
      "reason":"[parent] Data too large, data for [<http_request>] would be [998525888/952.2mb], which is larger than the limit of [986061209/940.3mb], real usage: [990325888/944.4mb], new bytes reserved: [8200000/7.8mb], usages [request=1556536/1.4mb, fielddata=0/0b, in_flight_requests=57400000/54.7mb, accounting=1522440/1.4mb]",
      "bytes_wanted":998525888,
      "bytes_limit":986061209,
      "durability":"TRANSIENT"
   },
   "status":429
}

I don't really understand the origination of "bytes_wanted":998525888 because as I can see my index has size: 475Mi,
but anyway, feels like an issue which is able to overcome. By increasing some parameter or splitting to less chunks maybe.

Is it me misusing elasticsearch somehow? What is legit way to upsert 10 mln documents?
Once again, any help is appreciated

@original-brownbear
Copy link
Member

hi @neseleznev

The sigsev do not necessarily look like JVM bugs but rather like an issue with your system (erroneous RAM looks like the most likely culprit here). I don't think there's anything we can do here and diagnosing this and/or helping with correctly configuring/sizing the circuit breaker is more of a user question I'm afraid.
We'd like to direct these kinds of things to the Elasticsearch forum. If you can stop by there, we'd appreciate it. This allows us to use GitHub for verified bug reports, feature requests, and pull requests.

There's an active community in the forum that should be able to help get an answer to your question. As such, I hope you don't mind that I close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug needs:triage Requires assignment of a team area label
Projects
None yet
Development

No branches or pull requests

2 participants