Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Searchable Snapshot: Search hangs when parallel searches to same remote index #6295

Closed
dgilling opened this issue Feb 13, 2023 · 16 comments
Assignees
Labels
bug Something isn't working distributed framework

Comments

@dgilling
Copy link

dgilling commented Feb 13, 2023

Describe the bug
A clear and concise description of what the bug is.
When performing a aggregation on a nested field on a searchable snapshot (where storage_type: remote_snapshot), the search task hangs for days if no timeout is defined (which is default behavior). This can block the node from handling future searches once search thread queue filled up.

To Reproduce
Steps to reproduce the behavior:

  1. Created a document which contains a nested field
  2. Restore the index as a remote_snapshot
  3. Perform a terms agg on a nested field (see example below)
  4. The search tasks will keep running and never complete
{
    "aggs": {
        "entTmSrs": {
            "filter": {
                "match_all": {}
            },
            "aggs": {
                "_nest_agg": {
                    "nested": {
                        "path": "nested_doc"
                    },
                    "aggs": {
                        "_key_match": {
                            "filter": {
                                "term": {
                                    "nested_doc.key": "some_value"
                                }
                            },
                            "aggs": {
                                "nested_doc.status": {
                                    "terms": {
                                        "field": "nested_doc.some_field",
                                        "size": 5,
                                        "min_doc_count": 1,
                                        "missing": "(none)"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "size": 0
}

Expected behavior
A clear and concise description of what you expected to happen.
The search request should complete or have a reasonable default timeout to not deadlock future searches on the node.
The same query on local index (not remote_snapshot) takes <100ms. We expect the queries to take longer, but not 2 days and keep retrying.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Dump of stuck tasks

curl localhost:9200/_cat/tasks
indices:data/read/search              VCwgCfiNTBKetKNgHY9j5A:12586568 -                               transport 1676066479789 22:01:19 1.9d        10.2.0.18 763d0e53942b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275069  VCwgCfiNTBKetKNgHY9j5A:12586568 transport 1676066507519 22:01:47 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              B4qWdIFETWe3AglIN4-Krg:13826364 -                               transport 1676066490202 22:01:30 1.9d        10.2.0.6  9cb9b11d3b13
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275062  B4qWdIFETWe3AglIN4-Krg:13826364 transport 1676066506248 22:01:46 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              xYZuUqqDQACxQraz2oF_Rw:10224070 -                               transport 1676066501292 22:01:41 1.9d        10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275074  xYZuUqqDQACxQraz2oF_Rw:10224070 transport 1676066507805 22:01:47 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              FeQkJipvT-qgPloq89yosw:19005585 -                               transport 1676066502092 22:01:42 1.9d        10.2.0.3  es-master2
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275098  FeQkJipvT-qgPloq89yosw:19005585 transport 1676066508604 22:01:48 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              CioczzjnTKSQEh8uLvEpgA:5319933  -                               transport 1676066502318 22:01:42 1.9d        10.2.0.22 f694462bab56
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275096  CioczzjnTKSQEh8uLvEpgA:5319933  transport 1676066508514 22:01:48 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              8qSnQ7U4SFK-7_MPyvawow:4579861  -                               transport 1676066530687 22:02:10 1.9d        10.2.0.25 5bbc0ebfd5b7
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275310  8qSnQ7U4SFK-7_MPyvawow:4579861  transport 1676066537199 22:02:17 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              xYZuUqqDQACxQraz2oF_Rw:10224230 -                               transport 1676066531030 22:02:11 1.9d        10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275314  xYZuUqqDQACxQraz2oF_Rw:10224230 transport 1676066537542 22:02:17 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              FeQkJipvT-qgPloq89yosw:19006130 -                               transport 1676066531796 22:02:11 1.9d        10.2.0.3  es-master2
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275329  FeQkJipvT-qgPloq89yosw:19006130 transport 1676066538308 22:02:18 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              xYZuUqqDQACxQraz2oF_Rw:10224240 -                               transport 1676066532799 22:02:12 1.9d        10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275350  xYZuUqqDQACxQraz2oF_Rw:10224240 transport 1676066539311 22:02:19 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              8qSnQ7U4SFK-7_MPyvawow:4579910  -                               transport 1676066533031 22:02:13 1.9d        10.2.0.25 5bbc0ebfd5b7
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275354  8qSnQ7U4SFK-7_MPyvawow:4579910  transport 1676066539543 22:02:19 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              iaqRNoTCStaRJNbnv2S7Sw:5330461  -                               transport 1676066540623 22:02:20 1.9d        10.2.0.8  b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275402  iaqRNoTCStaRJNbnv2S7Sw:5330461  transport 1676066546819 22:02:26 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              iaqRNoTCStaRJNbnv2S7Sw:5330467  -                               transport 1676066540646 22:02:20 1.9d        10.2.0.8  b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275404  iaqRNoTCStaRJNbnv2S7Sw:5330467  transport 1676066546842 22:02:26 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              iaqRNoTCStaRJNbnv2S7Sw:5330494  -                               transport 1676066545520 22:02:25 1.9d        10.2.0.8  b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275419  iaqRNoTCStaRJNbnv2S7Sw:5330494  transport 1676066551716 22:02:31 1.9d        10.2.0.19 2616efc46d6b

Host/Environment (please complete the following information):

  • OS: [e.g. iOS] Ubuntiu
  • Version [e.g. 22] OS 2.4

Additional context
Add any other context about the problem here.
Using Azure Blob Storage as the snapshot repo.

@dgilling
Copy link
Author

Update: Seems to happen consistently when multiple searches are performed in parallel against the same index or segments. ES 7 also had an issue: elastic/elasticsearch#85239

@andrross
Copy link
Member

Update: Seems to happen consistently when multiple searches are performed in parallel against the same index or segments. ES 7 also had an issue: elastic/elasticsearch#85239

Thanks @dgilling! Does this mean you can reproduce this without the nested aggregation?

@dgilling dgilling changed the title [BUG] Searchable Snapshot: Search hangs when terms agg on nested field [BUG] Searchable Snapshot: Search hangs when parallel searches to same remote index Feb 21, 2023
@dgilling
Copy link
Author

dgilling commented Feb 21, 2023

@andrross , yes it's reproduced regardless of nested aggregation. It's possible the nested aggregation just helped trigger the scenario with longer search time. Usually happens when at least 12 or 15 searches hit the same remote index. Artificially staggering the searches with enough time (like a few seconds of delay between each search) helps reduce the scenario.

I also see it regardless of using Azure Blob Storage or AWS s3 as the backing remote store.

I did look into the ES 7 issue and applied a hotfix, doesn't look to be hitting that exception condition or see any other exceptions with TRACE logs.

@dgilling
Copy link
Author

Updated title to reflect issue

@dgilling
Copy link
Author

@andrross Should also note that attempts to cancel the tasks or set a timeout doesn't seem to clean them up. Only a restart of the nodes with search role will clear out the stuck tasks.

@andrross
Copy link
Member

I was able to reproduce this by creating the nyc_taxis dataset using opensearch-benchmark and creating a searchable snapshot index from it, then running the following script to kick off 100 concurrent searches:

#!/bin/sh

for i in `seq 100`; do
  curl localhost:9200/remote_nyc_taxis/_search -H 'Content-Type:application/json' -d '{"size":0,"aggs":{"response_codes":{"terms":{"field":"payment_type"}}}}' && echo "----->$i" &
done

I took a thread dump from the resulting deadlock and it does indeed look like the issue detailed in #6437. I applied the fix from PR #6467 and was not able to reproduce the failure.

@kartg
Copy link
Member

kartg commented Feb 23, 2023

@andrross should this be closed as a dupe of #6437 ?

@andrross
Copy link
Member

@kartg Let's keep this open until we get full verification of the fix

@dgilling
Copy link
Author

Thanks @andrross for the quick fix.

The good news:

  • We backported the fix to local 2.6 branch. It indeed does fix the deadlock issue from our brief testing and do not see search task queue filling up.

The bad news:

  • Could be a possible memory leak. Seeing native OS memory allocation errors every 20 mins or so. Usually when pulling aggs on field data. The machine had roughly 40GB free when it occurs and not overcommitted. It could very well be unrelated to this fix as we originally was testing deadlock on 2.4.1. We'll do some more testing to see if a 2.6 issue.

example 1:

---------------  T H R E A D  ---------------

Current thread (0x00007ff730103050):  JavaThread "opensearch[622242d606c1][fetch_shard_started][T#6]" daemon [_thread_new, id=876, stack(0x00007ff688dde000,0x00007ff688edf000)]

Stack: [0x00007ff688dde000,0x00007ff688edf000],  sp=0x00007ff688edd990,  free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xede141]  VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x1a1
V  [libjvm.so+0xeded0d]  VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, __va_list_tag*)+0x2d
V  [libjvm.so+0x606e43]  report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*, ...)+0xc3
V  [libjvm.so+0xc15ab8]  os::pd_commit_memory(char*, unsigned long, bool)+0xd8
V  [libjvm.so+0xc0f3ef]  os::commit_memory(char*, unsigned long, bool)+0x1f
V  [libjvm.so+0xd965d7]  StackOverflow::create_stack_guard_pages()+0x57
V  [libjvm.so+0xe5e9c1]  JavaThread::run()+0x31
V  [libjvm.so+0xe62020]  Thread::call_run()+0xc0
V  [libjvm.so+0xc187e1]  thread_native_entry(Thread*)+0xe1

Example 2:


@dgilling dgilling closed this as completed Mar 2, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in Searchable Snapshots Mar 2, 2023
@dgilling dgilling reopened this Mar 2, 2023
@github-project-automation github-project-automation bot moved this from Done to In Progress in Searchable Snapshots Mar 2, 2023
@dgilling
Copy link
Author

dgilling commented Mar 2, 2023

@andrross @kartg for some more context, did some more testing today with official 2.6.0 image released yesterday.

  1. Crash happens on 2.6.0 regardless if fix from [BUG] [Searchable Snapshots] Potential deadlock in ConcurrentInvocationLinearizer #6437 is included
  2. Crash happens even if JDK is different (to rule out JDK issue). tested with OpenJDK 64-Bit Server VM (Red_Hat-17.0.5.0.8-2.el7openjdkportable) (build 17.0.5+8-LTS, mixed mode, sharing)
  3. Looks to happen on doing a term aggregation on remote index. Haven't explored code path much.
  4. Crash sometimes is a segfault and sometimes a malloc OOM error.
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   The process is running with CompressedOops enabled, and the Java Heap may be blocking the growth of the native heap
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
#   JVM is running with Zero Based Compressed Oops mode in which the Java heap is
#     placed in the first 32GB address space. The Java Heap base address is the
#     maximum limit for the native heap growth. Please use -XX:HeapBaseMinAddress
#     to set the Java Heap base and to place the Java Heap above 32GB virtual address.
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2787), pid=1, tid=278
#
# JRE version: OpenJDK Runtime Environment (Red_Hat-17.0.5.0.8-2.el7openjdkportable) (17.0.5+8) (build 17.0.5+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM (Red_Hat-17.0.5.0.8-2.el7openjdkportable) (17.0.5+8-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /usr/share/opensearch/core.1)
#
# JFR recording file will be written. Location: /usr/share/opensearch/hs_err_pid1.jfr
#

---------------  S U M M A R Y ------------

Host: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz, 20 cores, 157G, Amazon Linux release 2 (Karoo)
Time: Wed Mar  1 22:01:47 2023 UTC elapsed time: 620.796304 seconds (0d 0h 10m 20s)

---------------  T H R E A D  ---------------

Current thread (0x00007f2819138670):  JavaThread "C1 CompilerThread0" daemon [_thread_in_vm, id=278, stack(0x00007f27ac7f7000,0x00007f27ac8f8000)]


Current CompileTask:
C1: 620796 34468       3       org.datadog.jmxfetch.JmxAttribute::getBeanParametersList (302 bytes)

Stack: [0x00007f27ac7f7000,0x00007f27ac8f8000],  sp=0x00007f27ac8f5bf0,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xf4e5b2]  VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x1a2
V  [libjvm.so+0xf4f2bb]  VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, __va_list_tag*)+0x2b
V  [libjvm.so+0x615bf6]  report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*, ...)+0xd6
V  [libjvm.so+0xc0d21a]  os::pd_commit_memory(char*, unsigned long, unsigned long, bool)+0xda
V  [libjvm.so+0xc0624e]  os::commit_memory(char*, unsigned long, unsigned long, bool)+0x2e
V  [libjvm.so+0xf4600b]  VirtualSpace::expand_by(unsigned long, bool)+0x15b
V  [libjvm.so+0x7d23f6]  CodeHeap::expand_by(unsigned long)+0x96
V  [libjvm.so+0x5b147a]  CodeCache::allocate(int, int, bool, int)+0x9a
V  [libjvm.so+0xbcab42]  nmethod::new_nmethod(methodHandle const&, int, int, CodeOffsets*, int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, int, GrowableArrayView<RuntimeStub*> const&, char*, int, int, char const*, FailedSpeculation**)+0x212
V  [libjvm.so+0x54eb2f]  ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, bool, bool, RTMState, GrowableArrayView<RuntimeStub*> const&)+0x41f
V  [libjvm.so+0x472c5c]  Compilation::compile_method()+0x33c
V  [libjvm.so+0x472efb]  Compilation::Compilation(AbstractCompiler*, ciEnv*, ciMethod*, int, BufferBlob*, bool, DirectiveSet*)+0x22b
V  [libjvm.so+0x473a03]  Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0xc3
V  [libjvm.so+0x5e9eb1]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xea1
V  [libjvm.so+0x5eaae8]  CompileBroker::compiler_thread_loop()+0x508
V  [libjvm.so+0xed0737]  JavaThread::thread_main_inner()+0xc7
V  [libjvm.so+0xed379d]  Thread::call_run()+0x6d
V  [libjvm.so+0xc0fe11]  thread_native_entry(Thread*)+0xe1


Java Threads: ( => current thread )
  0x00007f281912d390 JavaThread "Reference Handler" daemon [_thread_blocked, id=272, stack(0x00007f27acdfd000,0x00007f27acefe000)]
  0x00007f281912e770 JavaThread "Finalizer" daemon [_thread_blocked, id=273, stack(0x00007f27accfc000,0x00007f27acdfd000)]
  0x00007f2819132f50 JavaThread "Signal Dispatcher" daemon [_thread_blocked, id=274, stack(0x00007f27acbfb000,0x00007f27accfc000)]
  0x00007f2819134300 JavaThread "Service Thread" daemon [_thread_blocked, id=275, stack(0x00007f27acafa000,0x00007f27acbfb000)]
  0x00007f2819135710 JavaThread "Monitor Deflation Thread" daemon [_thread_blocked, id=276, stack(0x00007f27ac9f9000,0x00007f27acafa000)]
  0x00007f2819137140 JavaThread "C2 CompilerThread0" daemon [_thread_blocked, id=277, stack(0x00007f27ac8f8000,0x00007f27ac9f9000)]
=>0x00007f2819138670 JavaThread "C1 CompilerThread0" daemon [_thread_in_vm, id=278, stack(0x00007f27ac7f7000,0x00007f27ac8f8000)]
  0x00007f2819139ae0 JavaThread "Sweeper thread" daemon [_thread_blocked, id=279, stack(0x00007f27ac6f6000,0x00007f27ac7f7000)]
  0x00007f2819144d70 JavaThread "Common-Cleaner" daemon [_thread_blocked, id=280, stack(0x00007f27ac5f5000,0x00007f27ac6f6000)]
  0x00007f2819870b90 JavaThread "dd-task-scheduler" daemon [_thread_blocked, id=284, stack(0x00007f273eaff000,0x00007f273ec00000)]
  0x00007f281a3320a0 JavaThread "OkHttp ConnectionPool" daemon [_thread_blocked, id=288, stack(0x00007f2727eff000,0x00007f2728000000)]
  0x00007f281a3342a0 JavaThread "Okio Watchdog" daemon [_thread_blocked, id=289, stack(0x00007f2727dfe000,0x00007f2727eff000)]
  0x00007f281a346510 JavaThread "dd-trace-monitor" daemon [_thread_blocked, id=290, stack(0x00007f2727cfd000,0x00007f2727dfe000)]
  0x00007f281a345260 JavaThread "dd-trace-processor" daemon [_thread_blocked, id=291, stack(0x00007f2727bfc000,0x00007f2727cfd000)]
  0x00007f281a3c52c0 JavaThread "dd-remote-config" daemon [_thread_blocked, id=292, stack(0x00007f2727afb000,0x00007f2727bfc000)]
  0x00007f281a3d7bc0 JavaThread "dd-telemetry" daemon [_thread_blocked, id=293, stack(0x00007f27279fa000,0x00007f2727afb000)]
  0x00007f281a405860 JavaThread "dd-profiler-recording-scheduler" daemon [_thread_blocked, id=294, stack(0x00007f27278f9000,0x00007f27279fa000)]
  0x00007f281a3f7bc0 JavaThread "Notification Thread" daemon [_thread_blocked, id=295, stack(0x00007f27277f8000,0x00007f27278f9000)]
  0x00007f281ba14120 JavaThread "ScheduledMetricCollectorsExecutor" [_thread_blocked, id=308, stack(0x00007f27275f5000,0x00007f27276f6000)]
  0x00007f281bb511c0 JavaThread "pool-3-thread-1" [_thread_blocked, id=309, stack(0x00007f27264e0000,0x00007f27265e1000)]
  0x00007f281bb686d0 JavaThread "Thread-5" [_thread_blocked, id=310, stack(0x00007f2725aff000,0x00007f2725c00000)]
  0x00007f26d84aede0 JavaThread "opensearch[5410a48b36ce][[timer]]" daemon [_thread_blocked, id=313, stack(0x00007f27ac4f4000,0x00007f27ac5f5000)]
  0x00007f26d84b62f0 JavaThread "opensearch[5410a48b36ce][scheduler][T#1]" daemon [_thread_blocked, id=314, stack(0x00007f273f485000,0x00007f273f586000)]
  0x00007f26d8e61b60 JavaThread "commons-pool-evictor" daemon [_thread_blocked, id=323, stack(0x00007f27243fc000,0x00007f27244fd000)]
  0x00007f26d9bde1c0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#1]" daemon [_thread_in_native, id=326, stack(0x00007f273c809000,0x00007f273c90a000)]
  0x00007f273026a230 JavaThread "opensearch[5410a48b36ce][generic][T#1]" daemon [_thread_blocked, id=332, stack(0x00007f27241fa000,0x00007f27242fb000)]
  0x00007f273013d100 JavaThread "opensearch[5410a48b36ce][generic][T#2]" daemon [_thread_blocked, id=333, stack(0x00007f273c90a000,0x00007f273ca0b000)]
  0x00007f2730153eb0 JavaThread "opensearch[5410a48b36ce][generic][T#3]" daemon [_thread_blocked, id=336, stack(0x00007f27246ff000,0x00007f2724800000)]
  0x00007f2730151650 JavaThread "opensearch[5410a48b36ce][generic][T#4]" daemon [_thread_blocked, id=337, stack(0x00007f27265e1000,0x00007f27266e2000)]
  0x00007f27340387c0 JavaThread "statsd-aggregator-thread" daemon [_thread_blocked, id=344, stack(0x00007f27258fd000,0x00007f27259fe000)]
  0x00007f273403fd90 JavaThread "StatsD-Processor-1" daemon [_thread_blocked, id=345, stack(0x00007f27245fe000,0x00007f27246ff000)]
  0x00007f2734040b60 JavaThread "StatsD-Sender-1" daemon [_thread_blocked, id=346, stack(0x00007f27244fd000,0x00007f27245fe000)]
  0x00007f27340fe010 JavaThread "dd-jmx-collector" daemon [_thread_blocked, id=357, stack(0x00007f26c00fd000,0x00007f26c01fe000)]
  0x00007f281bd887b0 JavaThread "opendistro_job_sweeper[T#1]" daemon [_thread_blocked, id=380, stack(0x00007f27259fe000,0x00007f2725aff000)]
  0x00007f281a4fb010 JavaThread "opensearch[5410a48b36ce][clusterApplierService#updateTask][T#1]" daemon [_thread_blocked, id=384, stack(0x00007f26c01fe000,0x00007f26c02ff000)]
  0x00007f26b40092f0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#4]" daemon [_thread_in_native, id=385, stack(0x00007f2697eff000,0x00007f2698000000)]
  0x00007f26dc00ab80 JavaThread "opensearch[5410a48b36ce][transport_worker][T#3]" daemon [_thread_in_native, id=386, stack(0x00007f2697dfe000,0x00007f2697eff000)]
  0x00007f26c499af50 JavaThread "opensearch[5410a48b36ce][transport_worker][T#2]" daemon [_thread_in_native, id=387, stack(0x00007f2697cfd000,0x00007f2697dfe000)]
  0x00007f26b4011d70 JavaThread "opensearch[5410a48b36ce][generic][T#6]" daemon [_thread_blocked, id=388, stack(0x00007f2697bfc000,0x00007f2697cfd000)]
  0x00007f26c4006060 JavaThread "opensearch[5410a48b36ce][generic][T#5]" daemon [_thread_blocked, id=389, stack(0x00007f2697afb000,0x00007f2697bfc000)]
  0x00007f26c4066220 JavaThread "opensearch[5410a48b36ce][transport_worker][T#6]" daemon [_thread_in_native, id=390, stack(0x00007f26979fa000,0x00007f2697afb000)]
  0x00007f26dc01d710 JavaThread "opensearch[5410a48b36ce][transport_worker][T#7]" daemon [_thread_in_native, id=391, stack(0x00007f26978f9000,0x00007f26979fa000)]
  0x00007f26b4009cb0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#5]" daemon [_thread_in_native, id=392, stack(0x00007f26977f8000,0x00007f26978f9000)]
  0x00007f26c4160d80 JavaThread "opensearch[5410a48b36ce][transport_worker][T#8]" daemon [_thread_in_native, id=393, stack(0x00007f26976f7000,0x00007f26977f8000)]
  0x00007f26dc01dc80 JavaThread "opensearch[5410a48b36ce][transport_worker][T#9]" daemon [_thread_in_native, id=394, stack(0x00007f26975f6000,0x00007f26976f7000)]
  0x00007f26c4998ec0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#11]" daemon [_thread_in_native, id=395, stack(0x00007f26974f5000,0x00007f26975f6000)]
  0x00007f26b401b880 JavaThread "opensearch[5410a48b36ce][transport_worker][T#10]" daemon [_thread_in_native, id=396, stack(0x00007f26973f4000,0x00007f26974f5000)]
  0x00007f26dc0472c0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#12]" daemon [_thread_in_native, id=397, stack(0x00007f26972f3000,0x00007f26973f4000)]
  0x00007f26c4999d00 JavaThread "opensearch[5410a48b36ce][transport_worker][T#13]" daemon [_thread_in_native, id=398, stack(0x00007f26971f2000,0x00007f26972f3000)]
  0x00007f26b401c2f0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#14]" daemon [_thread_in_native, id=399, stack(0x00007f26970f1000,0x00007f26971f2000)]
  0x00007f26dc047830 JavaThread "opensearch[5410a48b36ce][transport_worker][T#15]" daemon [_thread_in_native, id=400, stack(0x00007f2696ff0000,0x00007f26970f1000)]
  0x00007f26c4127390 JavaThread "opensearch[5410a48b36ce][transport_worker][T#16]" daemon [_thread_in_native, id=401, stack(0x00007f2696eef000,0x00007f2696ff0000)]
  0x00007f26b401d290 JavaThread "opensearch[5410a48b36ce][transport_worker][T#17]" daemon [_thread_in_native, id=402, stack(0x00007f2696dee000,0x00007f2696eef000)]
  0x00007f26dc04b090 JavaThread "opensearch[5410a48b36ce][transport_worker][T#18]" daemon [_thread_in_native, id=403, stack(0x00007f2696ced000,0x00007f2696dee000)]
  0x00007f26b4060d20 JavaThread "opensearch[5410a48b36ce][transport_worker][T#19]" daemon [_thread_in_native, id=404, stack(0x00007f2696bec000,0x00007f2696ced000)]
  0x00007f26dc04c120 JavaThread "opensearch[5410a48b36ce][transport_worker][T#20]" daemon [_thread_in_native, id=405, stack(0x00007f2696aeb000,0x00007f2696bec000)]
  0x00007f2644105c00 JavaThread "opensearch[5410a48b36ce][management][T#1]" daemon [_thread_blocked, id=413, stack(0x00007f26c0dfe000,0x00007f26c0eff000)]
  0x00007f2688121d90 JavaThread "opensearch[5410a48b36ce][management][T#2]" daemon [_thread_blocked, id=420, stack(0x00007f26966e7000,0x00007f26967e8000)]
  0x00007f2738727970 JavaThread "opensearch[5410a48b36ce][AsyncLucenePersistedState#updateTask][T#1]" daemon [_thread_blocked, id=424, stack(0x00007f273c708000,0x00007f273c809000)]
  0x00007f2708e31ab0 JavaThread "JFR Recorder Thread" daemon [_thread_blocked, id=431, stack(0x00007f26961e1000,0x00007f26962e2000)]
  0x00007f2708e523d0 JavaThread "JFR Periodic Tasks" daemon [_thread_blocked, id=433, stack(0x00007f2695160000,0x00007f2695261000)]
  0x00007f2668144820 JavaThread "opensearch[5410a48b36ce][fetch_shard_started][T#1]" daemon [_thread_blocked, id=439, stack(0x00007f26963e3000,0x00007f26964e4000)]
  0x00007f269c4b47a0 JavaThread "opensearch[5410a48b36ce][DanglingIndices#updateTask][T#1]" daemon [_thread_blocked, id=443, stack(0x00007f26969ea000,0x00007f2696aeb000)]
  0x00007f269c5ef780 JavaThread "opensearch[5410a48b36ce][snapshot][T#1]" daemon [_thread_blocked, id=444, stack(0x00007f26964e4000,0x00007f26965e5000)]
  0x00007f262838eab0 JavaThread "boundedElastic-evictor-1" daemon [_thread_blocked, id=445, stack(0x00007f26968e9000,0x00007f26969ea000)]
  0x00007f26200b9d00 JavaThread "reactor-nio-1-thread-1" [_thread_in_native, id=448, stack(0x00007f26956ff000,0x00007f2695800000)]
  0x00007f260412e5c0 JavaThread "parallel-1" daemon [_thread_blocked, id=449, stack(0x00007f26955fe000,0x00007f26956ff000)]
  0x00007f260413a6e0 JavaThread "parallel-2" daemon [_thread_blocked, id=450, stack(0x00007f26954fd000,0x00007f26955fe000)]
  0x00007f26dc0493e0 JavaThread "index-input-cleaner[T#1]" daemon [_thread_blocked, id=451, stack(0x00007f26953fc000,0x00007f26954fd000)]
  0x00007f25f8004130 JavaThread "reactor-nio-1-thread-2" [_thread_in_native, id=453, stack(0x00007f269405d000,0x00007f269415e000)]
  0x00007f260413d480 JavaThread "parallel-3" daemon [_thread_blocked, id=454, stack(0x00007f260aeff000,0x00007f260b000000)]
  0x00007f260413e3f0 JavaThread "parallel-4" daemon [_thread_blocked, id=455, stack(0x00007f260adfe000,0x00007f260aeff000)]
  0x00007f25e8003b70 JavaThread "reactor-nio-1-thread-3" [_thread_in_native, id=457, stack(0x00007f260abfc000,0x00007f260acfd000)]
  0x00007f25ec00c960 JavaThread "parallel-5" daemon [_thread_blocked, id=458, stack(0x00007f260aafb000,0x00007f260abfc000)]
  0x00007f25ec00d910 JavaThread "parallel-6" daemon [_thread_blocked, id=459, stack(0x00007f260a9fa000,0x00007f260aafb000)]
  0x00007f25dc00bd50 JavaThread "parallel-7" daemon [_thread_blocked, id=460, stack(0x00007f260a8f9000,0x00007f260a9fa000)]
  0x00007f25dc00ca70 JavaThread "parallel-8" daemon [_thread_blocked, id=461, stack(0x00007f260a7f8000,0x00007f260a8f9000)]
  0x00007f25d0003ca0 JavaThread "reactor-nio-1-thread-4" [_thread_in_native, id=463, stack(0x00007f260a5f6000,0x00007f260a6f7000)]
  0x00007f25ec00f830 JavaThread "parallel-9" daemon [_thread_blocked, id=464, stack(0x00007f260a4f5000,0x00007f260a5f6000)]
  0x00007f25ec005120 JavaThread "parallel-10" daemon [_thread_blocked, id=465, stack(0x00007f260a3f4000,0x00007f260a4f5000)]
  0x00007f25ec006280 JavaThread "parallel-11" daemon [_thread_blocked, id=466, stack(0x00007f260a2f3000,0x00007f260a3f4000)]
  0x00007f25ec007530 JavaThread "parallel-12" daemon [_thread_blocked, id=467, stack(0x00007f260a1f2000,0x00007f260a2f3000)]
  0x00007f25ec0088e0 JavaThread "parallel-13" daemon [_thread_blocked, id=468, stack(0x00007f260a0f1000,0x00007f260a1f2000)]
  0x00007f25ec009c30 JavaThread "parallel-14" daemon [_thread_blocked, id=470, stack(0x00007f2609ff0000,0x00007f260a0f1000)]
  0x00007f25dc004e10 JavaThread "parallel-15" daemon [_thread_blocked, id=471, stack(0x00007f2609eef000,0x00007f2609ff0000)]
  0x00007f25dc006190 JavaThread "parallel-16" daemon [_thread_blocked, id=472, stack(0x00007f2609dee000,0x00007f2609eef000)]
  0x00007f25dc0077d0 JavaThread "parallel-17" daemon [_thread_blocked, id=473, stack(0x00007f2609ced000,0x00007f2609dee000)]
  0x00007f25dc009370 JavaThread "parallel-18" daemon [_thread_blocked, id=474, stack(0x00007f2609bec000,0x00007f2609ced000)]
  0x00007f25ec00af60 JavaThread "parallel-19" daemon [_thread_blocked, id=475, stack(0x00007f2609aeb000,0x00007f2609bec000)]
  0x00007f25ec00c2e0 JavaThread "parallel-20" daemon [_thread_blocked, id=476, stack(0x00007f26099ea000,0x00007f2609aeb000)]
  0x00007f25d0005060 JavaThread "reactor-nio-1-thread-5" [_thread_in_native, id=478, stack(0x00007f26097e8000,0x00007f26098e9000)]
  0x00007f25e8005f40 JavaThread "reactor-nio-1-thread-6" [_thread_in_native, id=480, stack(0x00007f26095e6000,0x00007f26096e7000)]
  0x00007f25f8005530 JavaThread "reactor-nio-1-thread-7" [_thread_in_native, id=482, stack(0x00007f26093e4000,0x00007f26094e5000)]
  0x00007f269c2c2160 JavaThread "opensearch[5410a48b36ce][generic][T#7]" daemon [_thread_blocked, id=483, stack(0x00007f26092e3000,0x00007f26093e4000)]
  0x00007f25740047d0 JavaThread "reactor-nio-1-thread-8" [_thread_in_native, id=485, stack(0x00007f26090e1000,0x00007f26091e2000)]
  0x00007f269c2c5740 JavaThread "opensearch[5410a48b36ce][generic][T#8]" daemon [_thread_blocked, id=487, stack(0x00007f2608edf000,0x00007f2608fe0000)]
  0x00007f2564003b60 JavaThread "reactor-nio-1-thread-9" [_thread_in_native, id=489, stack(0x00007f2608cdd000,0x00007f2608dde000)]
  0x00007f273026b3a0 JavaThread "opensearch[5410a48b36ce][generic][T#9]" daemon [_thread_blocked, id=490, stack(0x00007f2608bdc000,0x00007f2608cdd000)]
  0x00007f2560003dc0 JavaThread "reactor-nio-1-thread-10" [_thread_in_native, id=492, stack(0x00007f26089da000,0x00007f2608adb000)]
  0x00007f269c2c6b00 JavaThread "opensearch[5410a48b36ce][generic][T#10]" daemon [_thread_blocked, id=493, stack(0x00007f26088d9000,0x00007f26089da000)]
  0x00007f26200c6ad0 JavaThread "reactor-nio-1-thread-11" [_thread_in_native, id=495, stack(0x00007f26086d7000,0x00007f26087d8000)]
  0x00007f269c2ca230 JavaThread "opensearch[5410a48b36ce][generic][T#11]" daemon [_thread_blocked, id=498, stack(0x00007f26083d4000,0x00007f26084d5000)]
  0x00007f25400045d0 JavaThread "reactor-nio-1-thread-12" [_thread_in_native, id=500, stack(0x00007f26081d2000,0x00007f26082d3000)]
  0x00007f269c2cbf20 JavaThread "opensearch[5410a48b36ce][generic][T#12]" daemon [_thread_blocked, id=502, stack(0x00007f2533eff000,0x00007f2534000000)]
  0x00007f2528004260 JavaThread "reactor-nio-1-thread-13" [_thread_in_native, id=504, stack(0x00007f2533cfd000,0x00007f2533dfe000)]
  0x00007f269c2cd3a0 JavaThread "opensearch[5410a48b36ce][generic][T#13]" daemon [_thread_blocked, id=505, stack(0x00007f2533bfc000,0x00007f2533cfd000)]
  0x00007f2520011090 JavaThread "reactor-nio-1-thread-14" [_thread_in_native, id=507, stack(0x00007f25339fa000,0x00007f2533afb000)]
  0x00007f269c2ce0b0 JavaThread "opensearch[5410a48b36ce][generic][T#14]" daemon [_thread_blocked, id=508, stack(0x00007f25338f9000,0x00007f25339fa000)]
  0x00007f269c2cf1c0 JavaThread "opensearch[5410a48b36ce][generic][T#15]" daemon [_thread_blocked, id=509, stack(0x00007f25337f8000,0x00007f25338f9000)]
  0x00007f254c004610 JavaThread "reactor-nio-1-thread-15" [_thread_in_native, id=511, stack(0x00007f25335f6000,0x00007f25336f7000)]
  0x00007f2520007cd0 JavaThread "reactor-nio-1-thread-16" [_thread_in_native, id=513, stack(0x00007f25333f4000,0x00007f25334f5000)]
  0x00007f269c2d05a0 JavaThread "opensearch[5410a48b36ce][generic][T#16]" daemon [_thread_blocked, id=514, stack(0x00007f25332f3000,0x00007f25333f4000)]
  0x00007f25600059f0 JavaThread "reactor-nio-1-thread-17" [_thread_in_native, id=516, stack(0x00007f25330f1000,0x00007f25331f2000)]
  0x00007f269c2d10d0 JavaThread "opensearch[5410a48b36ce][generic][T#17]" daemon [_thread_blocked, id=517, stack(0x00007f2532ff0000,0x00007f25330f1000)]
  0x00007f25200099d0 JavaThread "reactor-nio-1-thread-18" [_thread_in_native, id=519, stack(0x00007f2532dee000,0x00007f2532eef000)]
  0x00007f269c2d24f0 JavaThread "opensearch[5410a48b36ce][generic][T#18]" daemon [_thread_blocked, id=520, stack(0x00007f2532ced000,0x00007f2532dee000)]
  0x00007f256800bef0 JavaThread "reactor-nio-1-thread-19" [_thread_in_native, id=522, stack(0x00007f2532aeb000,0x00007f2532bec000)]
  0x00007f269c2d3520 JavaThread "opensearch[5410a48b36ce][generic][T#19]" daemon [_thread_blocked, id=523, stack(0x00007f25329ea000,0x00007f2532aeb000)]
  0x00007f256c097450 JavaThread "reactor-nio-1-thread-20" [_thread_in_native, id=525, stack(0x00007f25327e8000,0x00007f25328e9000)]
  0x00007f2730152cf0 JavaThread "opensearch[5410a48b36ce][generic][T#20]" daemon [_thread_blocked, id=526, stack(0x00007f25326e7000,0x00007f25327e8000)]
  0x00007f269c125d60 JavaThread "opensearch[5410a48b36ce][generic][T#21]" daemon [_thread_blocked, id=535, stack(0x00007f26960e0000,0x00007f26961e1000)]
  0x00007f273014e720 JavaThread "opensearch[5410a48b36ce][refresh][T#1]" daemon [_thread_blocked, id=536, stack(0x00007f27242fb000,0x00007f27243fc000)]
  0x00007f281a391f80 JavaThread "opensearch[keepAlive/2.6.0]" [_thread_blocked, id=540, stack(0x00007f2532097000,0x00007f2532198000)]
  0x00007f281802b9d0 JavaThread "DestroyJavaVM" [_thread_blocked, id=251, stack(0x00007f2821472000,0x00007f2821573000)]
  0x00007f2664148500 JavaThread "opensearch[5410a48b36ce][fetch_shard_store][T#1]" daemon [_thread_blocked, id=541, stack(0x00007f26c0cfd000,0x00007f26c0dfe000)]
  0x00007f27302700e0 JavaThread "opensearch[5410a48b36ce][flush][T#1]" daemon [_thread_blocked, id=543, stack(0x00007f26967e8000,0x00007f26968e9000)]
  0x00007f269c312470 JavaThread "opensearch[5410a48b36ce][generic][T#22]" daemon [_thread_blocked, id=557, stack(0x00007f252719e000,0x00007f252729f000)]
  0x00007f25800674d0 JavaThread "opensearch[5410a48b36ce][management][T#3]" daemon [_thread_blocked, id=578, stack(0x00007f26080d1000,0x00007f26081d2000)]
  0x00007f26c404ac10 JavaThread "opensearch[5410a48b36ce][management][T#4]" daemon [_thread_blocked, id=579, stack(0x00007f2608fe0000,0x00007f26090e1000)]
  0x00007f26c408e030 JavaThread "opensearch[5410a48b36ce][management][T#5]" daemon [_thread_blocked, id=583, stack(0x00007f27240f9000,0x00007f27241fa000)]
  0x00007f252002ae00 JavaThread "reactor-nio-1-thread-21" [_thread_in_native, id=599, stack(0x00007f2526f0a000,0x00007f252700b000)]
  0x00007f269c3f3d00 JavaThread "opensearch[5410a48b36ce][generic][T#23]" daemon [_thread_blocked, id=600, stack(0x00007f2526572000,0x00007f2526673000)]
  0x00007f269c3f4db0 JavaThread "opensearch[5410a48b36ce][generic][T#24]" daemon [_thread_blocked, id=601, stack(0x00007f26084d5000,0x00007f26085d6000)]
  0x00007f269c3f5e20 JavaThread "opensearch[5410a48b36ce][generic][T#25]" daemon [_thread_blocked, id=602, stack(0x00007f2518784000,0x00007f2518885000)]
  0x00007f253402fa20 JavaThread "reactor-nio-1-thread-22" [_thread_in_native, id=604, stack(0x00007f2518364000,0x00007f2518465000)]
  0x00007f25f800cfb0 JavaThread "reactor-nio-1-thread-23" [_thread_in_native, id=606, stack(0x00007f25180ff000,0x00007f2518200000)]
  0x00007f269c3f7d50 JavaThread "opensearch[5410a48b36ce][generic][T#26]" daemon [_thread_blocked, id=607, stack(0x00007f2517fd8000,0x00007f25180d9000)]
  0x00007f253c098a50 JavaThread "reactor-nio-1-thread-24" [_thread_in_native, id=609, stack(0x00007f2517d0d000,0x00007f2517e0e000)]
  0x00007f2709059320 JavaThread "dd-profiler-http-dispatcher" daemon [_thread_blocked, id=616, stack(0x00007f26085d6000,0x00007f26086d7000)]
  0x00007f26ac588b70 JavaThread "OkHttp ConnectionPool" daemon [_thread_blocked, id=617, stack(0x00007f273fce9000,0x00007f273fdea000)]
  0x00007f27300e05b0 JavaThread "opensearch[5410a48b36ce][flush][T#2]" daemon [_thread_blocked, id=622, stack(0x00007f252729f000,0x00007f25273a0000)]
  0x00007f27300e1060 JavaThread "opensearch[5410a48b36ce][flush][T#3]" daemon [_thread_blocked, id=623, stack(0x00007f27268fb000,0x00007f27269fc000)]
  0x00007f269c1707f0 JavaThread "opensearch[5410a48b36ce][generic][T#27]" daemon [_thread_blocked, id=635, stack(0x00007f26c02ff000,0x00007f26c0400000)]
  0x00007f25640569a0 JavaThread "reactor-nio-1-thread-25" [_thread_in_native, id=637, stack(0x00007f2513f51000,0x00007f2514052000)]
  0x00007f269c1715d0 JavaThread "opensearch[5410a48b36ce][generic][T#28]" daemon [_thread_blocked, id=638, stack(0x00007f250f33d000,0x00007f250f43e000)]
  0x00007f26d4004d70 JavaThread "reactor-nio-1-thread-26" [_thread_in_native, id=640, stack(0x00007f250f131000,0x00007f250f232000)]
  0x00007f269c09a0b0 JavaThread "opensearch[5410a48b36ce][generic][T#29]" daemon [_thread_blocked, id=642, stack(0x00007f250edf7000,0x00007f250eef8000)]
  0x00007f269c09abb0 JavaThread "opensearch[5410a48b36ce][generic][T#30]" daemon [_thread_blocked, id=643, stack(0x00007f250ecf6000,0x00007f250edf7000)]
  0x00007f25740bd880 JavaThread "reactor-nio-1-thread-27" [_thread_in_native, id=645, stack(0x00007f250eae8000,0x00007f250ebe9000)]
  0x00007f27300c0360 JavaThread "opensearch[5410a48b36ce][flush][T#4]" daemon [_thread_blocked, id=663, stack(0x00007f26c0eff000,0x00007f26c1000000)]
  0x00007f27300c0e70 JavaThread "opensearch[5410a48b36ce][flush][T#5]" daemon [_thread_blocked, id=664, stack(0x00007f250f00b000,0x00007f250f10c000)]
  0x00007f27a0006340 JavaThread "reactor-nio-1-thread-28" [_thread_in_native, id=689, stack(0x00007f250d763000,0x00007f250d864000)]
  0x00007f27301d4b40 JavaThread "opensearch[5410a48b36ce][generic][T#31]" daemon [_thread_blocked, id=690, stack(0x00007f2509391000,0x00007f2509492000)]
  0x00007f27301d5670 JavaThread "opensearch[5410a48b36ce][generic][T#32]" daemon [_thread_blocked, id=691, stack(0x00007f2503e2b000,0x00007f2503f2c000)]
  0x00007f27301d50b0 JavaThread "opensearch[5410a48b36ce][generic][T#33]" daemon [_thread_blocked, id=692, stack(0x00007f2500649000,0x00007f250074a000)]
  0x00007f26695ad2b0 JavaThread "opensearch[5410a48b36ce][search][T#1]" daemon [_thread_blocked, id=695, stack(0x00007f24e0e76000,0x00007f24e0f77000)]
  0x00007f26694e0580 JavaThread "opensearch[5410a48b36ce][search][T#2]" daemon [_thread_blocked, id=696, stack(0x00007f24e0d73000,0x00007f24e0e74000)]
  0x00007f26694df770 JavaThread "opensearch[5410a48b36ce][search][T#3]" daemon [_thread_blocked, id=698, stack(0x00007f24e0c72000,0x00007f24e0d73000)]
  0x00007f26694e18d0 JavaThread "opensearch[5410a48b36ce][search][T#4]" daemon [_thread_blocked, id=699, stack(0x00007f24e0b71000,0x00007f24e0c72000)]
  0x00007f2669651720 JavaThread "opensearch[5410a48b36ce][search][T#5]" daemon [_thread_blocked, id=700, stack(0x00007f24e0a70000,0x00007f24e0b71000)]
  0x00007f2669653b40 JavaThread "opensearch[5410a48b36ce][search][T#6]" daemon [_thread_blocked, id=701, stack(0x00007f24e096f000,0x00007f24e0a70000)]
  0x00007f2669656830 JavaThread "opensearch[5410a48b36ce][search][T#7]" daemon [_thread_blocked, id=702, stack(0x00007f24e086e000,0x00007f24e096f000)]
  0x00007f2669657c00 JavaThread "opensearch[5410a48b36ce][search][T#8]" daemon [_thread_blocked, id=703, stack(0x00007f24e076d000,0x00007f24e086e000)]
  0x00007f266965a080 JavaThread "opensearch[5410a48b36ce][search][T#9]" daemon [_thread_blocked, id=704, stack(0x00007f24e066c000,0x00007f24e076d000)]
  0x00007f266965b4a0 JavaThread "opensearch[5410a48b36ce][search][T#10]" daemon [_thread_blocked, id=705, stack(0x00007f24e056b000,0x00007f24e066c000)]
  0x00007f266965c8d0 JavaThread "opensearch[5410a48b36ce][search][T#11]" daemon [_thread_blocked, id=706, stack(0x00007f24e046a000,0x00007f24e056b000)]
  0x00007f2669666cd0 JavaThread "opensearch[5410a48b36ce][search][T#12]" daemon [_thread_blocked, id=707, stack(0x00007f24e0369000,0x00007f24e046a000)]
  0x00007f26696699d0 JavaThread "opensearch[5410a48b36ce][search][T#13]" daemon [_thread_blocked, id=708, stack(0x00007f24e0268000,0x00007f24e0369000)]
  0x00007f266966ad90 JavaThread "opensearch[5410a48b36ce][search][T#14]" daemon [_thread_blocked, id=709, stack(0x00007f24e0165000,0x00007f24e0266000)]
  0x00007f2669675210 JavaThread "opensearch[5410a48b36ce][search][T#15]" daemon [_thread_blocked, id=710, stack(0x00007f24e0064000,0x00007f24e0165000)]
  0x00007f26696865f0 JavaThread "opensearch[5410a48b36ce][search][T#16]" daemon [_thread_blocked, id=711, stack(0x00007f24dff63000,0x00007f24e0064000)]
  0x00007f2669690230 JavaThread "opensearch[5410a48b36ce][search][T#17]" daemon [_thread_blocked, id=712, stack(0x00007f24dfe62000,0x00007f24dff63000)]
  0x00007f266969bcf0 JavaThread "opensearch[5410a48b36ce][search][T#18]" daemon [_thread_blocked, id=713, stack(0x00007f24dfd61000,0x00007f24dfe62000)]
  0x00007f266969ae60 JavaThread "opensearch[5410a48b36ce][search][T#19]" daemon [_thread_blocked, id=714, stack(0x00007f24dfc60000,0x00007f24dfd61000)]
  0x00007f266969cad0 JavaThread "opensearch[5410a48b36ce][search][T#20]" daemon [_thread_blocked, id=715, stack(0x00007f24dfb5f000,0x00007f24dfc60000)]
  0x00007f266969dee0 JavaThread "opensearch[5410a48b36ce][search][T#21]" daemon [_thread_blocked, id=716, stack(0x00007f24dfa5e000,0x00007f24dfb5f000)]
  0x00007f266969fb20 JavaThread "opensearch[5410a48b36ce][search][T#22]" daemon [_thread_blocked, id=717, stack(0x00007f24df95d000,0x00007f24dfa5e000)]
  0x00007f26696a8f10 JavaThread "opensearch[5410a48b36ce][search][T#23]" daemon [_thread_blocked, id=718, stack(0x00007f24df85c000,0x00007f24df95d000)]
  0x00007f26696aa300 JavaThread "opensearch[5410a48b36ce][search][T#24]" daemon [_thread_blocked, id=719, stack(0x00007f24df75b000,0x00007f24df85c000)]
  0x00007f26696ab700 JavaThread "opensearch[5410a48b36ce][search][T#25]" daemon [_thread_blocked, id=720, stack(0x00007f24df65a000,0x00007f24df75b000)]
  0x00007f26696acb30 JavaThread "opensearch[5410a48b36ce][search][T#26]" daemon [_thread_blocked, id=721, stack(0x00007f24df557000,0x00007f24df658000)]
  0x00007f26696adb20 JavaThread "opensearch[5410a48b36ce][search][T#27]" daemon [_thread_blocked, id=722, stack(0x00007f24df456000,0x00007f24df557000)]
  0x00007f26696afbb0 JavaThread "opensearch[5410a48b36ce][search][T#28]" daemon [_thread_blocked, id=723, stack(0x00007f24df355000,0x00007f24df456000)]
  0x00007f26696b0fb0 JavaThread "opensearch[5410a48b36ce][search][T#29]" daemon [_thread_blocked, id=724, stack(0x00007f24df252000,0x00007f24df353000)]
  0x00007f26696b23b0 JavaThread "opensearch[5410a48b36ce][search][T#30]" daemon [_thread_blocked, id=725, stack(0x00007f24df151000,0x00007f24df252000)]
  0x00007f26696c37a0 JavaThread "opensearch[5410a48b36ce][search][T#31]" daemon [_thread_blocked, id=726, stack(0x00007f24df050000,0x00007f24df151000)]
  0x00007f2638c463b0 JavaThread "reactor-nio-1-thread-29" [_thread_in_native, id=765, stack(0x00007f2533dfe000,0x00007f2533eff000)]
  0x00007f27301ce8c0 JavaThread "opensearch[5410a48b36ce][opensearch_asynchronous_search_generic][T#1]" daemon [_thread_blocked, id=771, stack(0x00007f24db601000,0x00007f24db702000)]
  0x00007f269c53d210 JavaThread "opensearch[5410a48b36ce][generic][T#34]" daemon [_thread_blocked, id=772, stack(0x00007f24def47000,0x00007f24df048000)]
  0x00007f269c03a060 JavaThread "opensearch[5410a48b36ce][generic][T#35]" daemon [_thread_blocked, id=773, stack(0x00007f24db3ff000,0x00007f24db500000)]
  0x00007f269c48d4e0 JavaThread "opensearch[5410a48b36ce][generic][T#36]" daemon [_thread_blocked, id=774, stack(0x00007f24e0f7f000,0x00007f24e1080000)]
  0x00007f269c039670 JavaThread "opensearch[5410a48b36ce][generic][T#37]" daemon [_thread_blocked, id=775, stack(0x00007f24db500000,0x00007f24db601000)]
  0x00007f269c48df00 JavaThread "opensearch[5410a48b36ce][generic][T#38]" daemon [_thread_blocked, id=776, stack(0x00007f24c9415000,0x00007f24c9516000)]
  0x00007f269c03b110 JavaThread "opensearch[5410a48b36ce][generic][T#39]" daemon [_thread_blocked, id=777, stack(0x00007f24c928c000,0x00007f24c938d000)]
  0x00007f25640316b0 JavaThread "reactor-nio-1-thread-30" [_thread_in_native, id=779, stack(0x00007f24c6051000,0x00007f24c6152000)]
  0x00007f261c29a1a0 JavaThread "reactor-nio-1-thread-31" [_thread_in_native, id=781, stack(0x00007f24c5da3000,0x00007f24c5ea4000)]
  0x00007f26240882a0 JavaThread "reactor-nio-1-thread-32" [_thread_in_native, id=783, stack(0x00007f24c5ad4000,0x00007f24c5bd5000)]
  0x00007f2554036a00 JavaThread "reactor-nio-1-thread-33" [_thread_in_native, id=785, stack(0x00007f24c5858000,0x00007f24c5959000)]
  0x00007f2534060260 JavaThread "reactor-nio-1-thread-34" [_thread_in_native, id=787, stack(0x00007f24c5574000,0x00007f24c5675000)]
  0x00007f27300dcc20 JavaThread "opendistro_job_sweeper[T#1]" daemon [_thread_blocked, id=788, stack(0x00007f24c1492000,0x00007f24c1593000)]
  0x00007f26a04b65f0 JavaThread "DefaultDispatcher-worker-1" daemon [_thread_blocked, id=789, stack(0x00007f24bff75000,0x00007f24c0076000)]
  0x00007f26a04b7bd0 JavaThread "DefaultDispatcher-worker-2" daemon [_thread_blocked, id=790, stack(0x00007f24bfe74000,0x00007f24bff75000)]
  0x00007f259803a4e0 JavaThread "boundedElastic-43" daemon [_thread_blocked, id=804, stack(0x00007f24c5eae000,0x00007f24c5faf000)]
  0x00007f2568043430 JavaThread "boundedElastic-44" daemon [_thread_blocked, id=805, stack(0x00007f2608dde000,0x00007f2608edf000)]
  0x00007f25a0043ca0 JavaThread "boundedElastic-42" daemon [_thread_blocked, id=806, stack(0x00007f250ebf3000,0x00007f250ecf4000)]
  0x00007f259002f930 JavaThread "boundedElastic-45" daemon [_thread_blocked, id=807, stack(0x00007f24dee44000,0x00007f24def45000)]
  0x00007f259402ca60 JavaThread "boundedElastic-46" daemon [_thread_blocked, id=808, stack(0x00007f260acfd000,0x00007f260adfe000)]

Other Threads:
  0x00007f28191292e0 VMThread "VM Thread" [stack: 0x00007f27acf00000,0x00007f27ad000000] [id=271] _threads_hazard_ptr=0x00007f25d042b850
  0x00007f281a36f710 WatcherThread [stack: 0x00007f27276f8000,0x00007f27277f8000] [id=296]
  0x00007f2818060f50 GCTaskThread "GC Thread#0" [stack: 0x00007f281d24f000,0x00007f281d34f000] [id=252]
  0x00007f28180b3770 GCTaskThread "GC Thread#1" [stack: 0x00007f281c03a000,0x00007f281c13a000] [id=255]
  0x00007f28180b4560 GCTaskThread "GC Thread#2" [stack: 0x00007f27bcacc000,0x00007f27bcbcc000] [id=256]
  0x00007f28180b5350 GCTaskThread "GC Thread#3" [stack: 0x00007f27bc9ca000,0x00007f27bcaca000] [id=257]
  0x00007f28180b61b0 GCTaskThread "GC Thread#4" [stack: 0x00007f27bc8c8000,0x00007f27bc9c8000] [id=258]
  0x00007f28180b7010 GCTaskThread "GC Thread#5" [stack: 0x00007f27bc7c6000,0x00007f27bc8c6000] [id=259]
  0x00007f28180b7e70 GCTaskThread "GC Thread#6" [stack: 0x00007f27bc6c4000,0x00007f27bc7c4000] [id=260]
  0x00007f28180b8cd0 GCTaskThread "GC Thread#7" [stack: 0x00007f27bc5c2000,0x00007f27bc6c2000] [id=261]
  0x00007f28180b9b30 GCTaskThread "GC Thread#8" [stack: 0x00007f27bc4c0000,0x00007f27bc5c0000] [id=262]
  0x00007f28180ba990 GCTaskThread "GC Thread#9" [stack: 0x00007f27bc3be000,0x00007f27bc4be000] [id=263]
  0x00007f28180bb7f0 GCTaskThread "GC Thread#10" [stack: 0x00007f27bc2bc000,0x00007f27bc3bc000] [id=264]
  0x00007f28180bc650 GCTaskThread "GC Thread#11" [stack: 0x00007f27bc1ba000,0x00007f27bc2ba000] [id=265]
  0x00007f28180bd4b0 GCTaskThread "GC Thread#12" [stack: 0x00007f27bc0b8000,0x00007f27bc1b8000] [id=266]
  0x00007f28180be310 GCTaskThread "GC Thread#13" [stack: 0x00007f27adf00000,0x00007f27ae000000] [id=267]
  0x00007f28180bf170 GCTaskThread "GC Thread#14" [stack: 0x00007f27addfe000,0x00007f27adefe000] [id=268]
  0x00007f2818071510 ConcurrentGCThread "G1 Main Marker" [stack: 0x00007f281d14d000,0x00007f281d24d000] [id=253]
  0x00007f2818072410 ConcurrentGCThread "G1 Conc#0" [stack: 0x00007f281d04b000,0x00007f281d14b000] [id=254]
  0x00007f27b0000cf0 ConcurrentGCThread "G1 Conc#1" [stack: 0x00007f27274f5000,0x00007f27275f5000] [id=298]
  0x00007f27b0001740 ConcurrentGCThread "G1 Conc#2" [stack: 0x00007f2726b00000,0x00007f2726c00000] [id=299]
  0x00007f27b00021c0 ConcurrentGCThread "G1 Conc#3" [stack: 0x00007f27269fe000,0x00007f2726afe000] [id=300]
  0x00007f28190faf60 ConcurrentGCThread "G1 Refine#0" [stack: 0x00007f27adcfc000,0x00007f27addfc000] [id=269]
  0x00007f2774008180 ConcurrentGCThread "G1 Refine#1" [stack: 0x00007f26965e7000,0x00007f26966e7000] [id=421]
  0x00007f262406d630 ConcurrentGCThread "G1 Refine#2" [stack: 0x00007f2515a53000,0x00007f2515b53000] [id=618]
  0x00007f26cc00cae0 ConcurrentGCThread "G1 Refine#3" [stack: 0x00007f2515951000,0x00007f2515a51000] [id=619]
  0x00007f26e80203d0 ConcurrentGCThread "G1 Refine#4" [stack: 0x00007f251584d000,0x00007f251594d000] [id=620]
  0x00007f267c04b260 ConcurrentGCThread "G1 Refine#5" [stack: 0x00007f2515749000,0x00007f2515849000] [id=621]
  0x00007f2688124100 ConcurrentGCThread "G1 Refine#6" [stack: 0x00007f2511911000,0x00007f2511a11000] [id=633]
  0x00007f26b8001310 ConcurrentGCThread "G1 Refine#7" [stack: 0x00007f250bbd7000,0x00007f250bcd7000] [id=650]
  0x00007f2709109980 ConcurrentGCThread "G1 Refine#8" [stack: 0x00007f250bad3000,0x00007f250bbd3000] [id=651]
  0x00007f2704009b00 ConcurrentGCThread "G1 Refine#9" [stack: 0x00007f250b9cd000,0x00007f250bacd000] [id=652]
  0x00007f2710008270 ConcurrentGCThread "G1 Refine#10" [stack: 0x00007f250b8c1000,0x00007f250b9c1000] [id=653]
  0x00007f270c01e240 ConcurrentGCThread "G1 Refine#11" [stack: 0x00007f250b7bf000,0x00007f250b8bf000] [id=654]
  0x00007f27180056d0 ConcurrentGCThread "G1 Refine#12" [stack: 0x00007f250b6b9000,0x00007f250b7b9000] [id=655]
  0x00007f2714001410 ConcurrentGCThread "G1 Refine#13" [stack: 0x00007f250b5b1000,0x00007f250b6b1000] [id=656]
  0x00007f271c0053b0 ConcurrentGCThread "G1 Refine#14" [stack: 0x00007f250b4a7000,0x00007f250b5a7000] [id=657]
  0x00007f28190fbe50 ConcurrentGCThread "G1 Service" [stack: 0x00007f27adbfa000,0x00007f27adcfa000] [id=270]

Threads with active compile tasks:
C2 CompilerThread0   620802 34467       4       org.datadog.jmxfetch.JmxSimpleAttribute::match (70 bytes)
C1 CompilerThread0   620802 34468       3       org.datadog.jmxfetch.JmxAttribute::getBeanParametersList (302 bytes)

VM state: synchronizing (normal execution)

VM Mutex/Monitor currently owned by a thread:  ([mutex/lock_event])
[0x00007f281801bf60] CodeCache_lock - owner thread: 0x00007f2819138670
[0x00007f28180280b0] Threads_lock - owner thread: 0x00007f28191292e0
[0x00007f2818028860] Heap_lock - owner thread: 0x00007f26696c37a0
[0x00007f2818029000] Compile_lock - owner thread: 0x00007f2819138670
[0x00007f2818029210] MethodCompileQueue_lock - owner thread: 0x00007f2819138670

OutOfMemory and StackOverflow Exception counts:
OutOfMemoryError java_heap_errors=3
LinkageErrors=3645

Heap address: 0x0000000080000000, size: 30720 MB, Compressed Oops mode: Zero based, Oop shift amount: 3

CDS archive(s) mapped at: [0x0000000800000000-0x0000000800a7b000-0x0000000800a7b000), size 10989568, SharedBaseAddress: 0x0000000800000000, ArchiveRelocationMode: 0.
Compressed class space mapped at: 0x0000000800c00000-0x0000000840c00000, reserved size: 1073741824
Narrow klass base: 0x0000000800000000, Narrow klass shift: 0, Narrow klass range: 0x100000000

GC Precious Log:
 CPUs: 20 total, 20 available
 Memory: 157G
 Large Page Support: Disabled
 NUMA Support: Disabled
 Compressed Oops: Enabled (Zero based)
 Heap Region Size: 16M
 Heap Min Capacity: 30G
 Heap Initial Capacity: 30G
 Heap Max Capacity: 30G
 Pre-touch: Enabled
 Parallel Workers: 15
 Concurrent Workers: 4
 Concurrent Refinement Workers: 15
 Periodic GC: Disabled

Heap:
 garbage-first heap   total 31457280K, used 8678319K [0x0000000080000000, 0x0000000800000000)
  region size 16384K, 6 young (98304K), 0 survivors (0K)
 Metaspace       used 156649K, committed 158272K, reserved 1187840K
  class space    used 19616K, committed 20352K, reserved 1048576K

@dgilling
Copy link
Author

dgilling commented Mar 3, 2023

@andrross @kartg Did some more testing given discussion on potentially too many file descriptors from the remote cache. After a second pass at crash report, we did see over 200k opened to remote store related files.

Since not too familiar with the remote cache code, let me know if those are not the right things to look at.

While it seemed to help prevent the crash, it uncovered some new exceptions:

    | Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.lucene.store.IndexInput.clone()" because "this.luceneIndexInput" is null
es-cold-search_1      | 	at org.opensearch.index.store.remote.filecache.FileCachedIndexInput.clone(FileCachedIndexInput.java:131) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.filecache.FileCachedIndexInput.clone(FileCachedIndexInput.java:26) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.utils.TransferManager.fetchBlob(TransferManager.java:56) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.fetchBlock(OnDemandBlockSnapshotIndexInput.java:148) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.demandBlock(OnDemandBlockIndexInput.java:347) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.seekInternal(OnDemandBlockIndexInput.java:318) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.seek(OnDemandBlockIndexInput.java:216) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.seek(OnDemandBlockSnapshotIndexInput.java:28) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.readByte(OnDemandBlockIndexInput.java:157) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.readByte(OnDemandBlockSnapshotIndexInput.java:28) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.apache.lucene.codecs.CodecUtil.readBEInt(CodecUtil.java:667) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:184) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:253) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.<init>(Lucene90BlockTreeTermsReader.java:128) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.fieldsProducer(Lucene90PostingsFormat.java:427) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:330) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:392) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:118) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:92) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:94) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:77) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:768) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:146) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.opensearch.common.lucene.Lucene.readSegmentInfosExtendedCompatibility(Lucene.java:183) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.Store.readSegmentInfosExtendedCompatibility(Store.java:266) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:225) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:511) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:113) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-2.6.0.jar:2.6.0]

@andrross
Copy link
Member

andrross commented Mar 3, 2023

Thanks @dgilling! We're clearly missing some limits for cases like yours. You did make the correct changes (if you're able to test with a 2.7 snapshot build, the configuration setting like node.search.cache.size: 40GB is available). We also have on the backlog to make that block size configurable.

We had a similar issue with the clone bug you just pointed out but it looks like it is not completely fixed. I'll dig into that now.

@andrross
Copy link
Member

andrross commented Mar 3, 2023

@dgilling It would also be interesting to see what kind of pressure the file cache is under during your testing, if at all possible. You can get snapshots of the file cache stats by doing GET _nodes/stats/file_cache:

{
    "_nodes": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "cluster_name": "opensearch",
    "nodes": {
        "XHirByVjTa-eAdliKVoSqg": {
            "file_cache": {
                "timestamp": 1677883656634,
                "active_in_bytes": 0,
                "total_in_bytes": 10737418240,
                "used_in_bytes": 0,
                "evicted_in_bytes": 0,
                "removed_in_bytes": 0,
                "replaced_count": 0,
                "active_percent": 0,
                "used_percent": 0,
                "cache_hits": 0,
                "cache_miss": 0
            }
        }
    }
}

@dgilling
Copy link
Author

dgilling commented Mar 5, 2023

@andrross sure, let me know when you're able to push a fix for Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.lucene.store.IndexInput.clone()" because "this.luceneIndexInput" is null and we can rerun/get some stats.

andrross added a commit to andrross/OpenSearch that referenced this issue Mar 9, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Mar 9, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Mar 9, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Mar 10, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Mar 10, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Mar 10, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Mar 10, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Mar 11, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Mar 11, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
kotwanikunal pushed a commit that referenced this issue Mar 11, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue #6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
opensearch-trigger-bot bot pushed a commit that referenced this issue Mar 11, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue #6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
(cherry picked from commit d139ebc)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kotwanikunal pushed a commit that referenced this issue Mar 11, 2023
The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue #6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.


(cherry picked from commit d139ebc)

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@kotwanikunal
Copy link
Member

@dgilling Hey Derric!
The fix for concurrency issues has been merged in (#6592) and backported to 2.x (#6630).
I think this should resolve a bunch of issues with your tests, if you can give it a try.

@dgilling
Copy link
Author

@kotwanikunal @andrross , looks like that did the trick. No more file descriptor explosion. Much appreciated. Will monitor and report on stats as we get more data

@github-project-automation github-project-automation bot moved this from In Progress to Done in Searchable Snapshots Mar 12, 2023
mingshl pushed a commit to mingshl/OpenSearch-Mingshl that referenced this issue Mar 24, 2023
…t#6592)

The previous implementation had an inherent race condition where a
zero-reference count IndexInput read from the cache could be evicted
before the IndexInput was cloned (and therefore had its reference count
incremented). Since the IndexInputs are stateful this is very bad. The
least-recently-used semantics meant that in a properly-configured system
this would be unlikely since accessing a zero-reference count item would
move it to be most-recently used and therefore least likely to be
evicted. However, there was still a latent bug that was possible to
encounter (see issue opensearch-project#6295).

The only way to fix this, as far as I can see, is to change the cache
behavior so that fetching an item from the cache atomically
increments its reference count. This also led to a change to
TransferManager to ensure that all requests for an item ultimately read
through the cache to eliminate any possibility of a race. I have
implement some concurrent unit tests that put the cache into a
worst-case thrashing scenario to ensure that concurrent access never
closes an IndexInput while it is still being used.

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Mingshi Liu <mingshl@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
Status: Done
Development

No branches or pull requests

5 participants