[BUG] Searchable Snapshot: Search hangs when parallel searches to same remote index #6295

dgilling · 2023-02-13T06:20:59Z

Describe the bug
A clear and concise description of what the bug is.
When performing a aggregation on a nested field on a searchable snapshot (where storage_type: remote_snapshot), the search task hangs for days if no timeout is defined (which is default behavior). This can block the node from handling future searches once search thread queue filled up.

To Reproduce
Steps to reproduce the behavior:

Created a document which contains a nested field
Restore the index as a remote_snapshot
Perform a terms agg on a nested field (see example below)
The search tasks will keep running and never complete

{
    "aggs": {
        "entTmSrs": {
            "filter": {
                "match_all": {}
            },
            "aggs": {
                "_nest_agg": {
                    "nested": {
                        "path": "nested_doc"
                    },
                    "aggs": {
                        "_key_match": {
                            "filter": {
                                "term": {
                                    "nested_doc.key": "some_value"
                                }
                            },
                            "aggs": {
                                "nested_doc.status": {
                                    "terms": {
                                        "field": "nested_doc.some_field",
                                        "size": 5,
                                        "min_doc_count": 1,
                                        "missing": "(none)"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "size": 0
}

Expected behavior
A clear and concise description of what you expected to happen.
The search request should complete or have a reasonable default timeout to not deadlock future searches on the node.
The same query on local index (not remote_snapshot) takes <100ms. We expect the queries to take longer, but not 2 days and keep retrying.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Dump of stuck tasks

curl localhost:9200/_cat/tasks
indices:data/read/search              VCwgCfiNTBKetKNgHY9j5A:12586568 -                               transport 1676066479789 22:01:19 1.9d        10.2.0.18 763d0e53942b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275069  VCwgCfiNTBKetKNgHY9j5A:12586568 transport 1676066507519 22:01:47 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              B4qWdIFETWe3AglIN4-Krg:13826364 -                               transport 1676066490202 22:01:30 1.9d        10.2.0.6  9cb9b11d3b13
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275062  B4qWdIFETWe3AglIN4-Krg:13826364 transport 1676066506248 22:01:46 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              xYZuUqqDQACxQraz2oF_Rw:10224070 -                               transport 1676066501292 22:01:41 1.9d        10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275074  xYZuUqqDQACxQraz2oF_Rw:10224070 transport 1676066507805 22:01:47 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              FeQkJipvT-qgPloq89yosw:19005585 -                               transport 1676066502092 22:01:42 1.9d        10.2.0.3  es-master2
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275098  FeQkJipvT-qgPloq89yosw:19005585 transport 1676066508604 22:01:48 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              CioczzjnTKSQEh8uLvEpgA:5319933  -                               transport 1676066502318 22:01:42 1.9d        10.2.0.22 f694462bab56
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275096  CioczzjnTKSQEh8uLvEpgA:5319933  transport 1676066508514 22:01:48 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              8qSnQ7U4SFK-7_MPyvawow:4579861  -                               transport 1676066530687 22:02:10 1.9d        10.2.0.25 5bbc0ebfd5b7
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275310  8qSnQ7U4SFK-7_MPyvawow:4579861  transport 1676066537199 22:02:17 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              xYZuUqqDQACxQraz2oF_Rw:10224230 -                               transport 1676066531030 22:02:11 1.9d        10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275314  xYZuUqqDQACxQraz2oF_Rw:10224230 transport 1676066537542 22:02:17 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              FeQkJipvT-qgPloq89yosw:19006130 -                               transport 1676066531796 22:02:11 1.9d        10.2.0.3  es-master2
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275329  FeQkJipvT-qgPloq89yosw:19006130 transport 1676066538308 22:02:18 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              xYZuUqqDQACxQraz2oF_Rw:10224240 -                               transport 1676066532799 22:02:12 1.9d        10.2.0.10 c2492386389b
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275350  xYZuUqqDQACxQraz2oF_Rw:10224240 transport 1676066539311 22:02:19 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              8qSnQ7U4SFK-7_MPyvawow:4579910  -                               transport 1676066533031 22:02:13 1.9d        10.2.0.25 5bbc0ebfd5b7
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275354  8qSnQ7U4SFK-7_MPyvawow:4579910  transport 1676066539543 22:02:19 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              iaqRNoTCStaRJNbnv2S7Sw:5330461  -                               transport 1676066540623 22:02:20 1.9d        10.2.0.8  b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275402  iaqRNoTCStaRJNbnv2S7Sw:5330461  transport 1676066546819 22:02:26 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              iaqRNoTCStaRJNbnv2S7Sw:5330467  -                               transport 1676066540646 22:02:20 1.9d        10.2.0.8  b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275404  iaqRNoTCStaRJNbnv2S7Sw:5330467  transport 1676066546842 22:02:26 1.9d        10.2.0.19 2616efc46d6b
indices:data/read/search              iaqRNoTCStaRJNbnv2S7Sw:5330494  -                               transport 1676066545520 22:02:25 1.9d        10.2.0.8  b6c0a935adb0
indices:data/read/search[phase/query] pyR8oklXQpGgC2lYXgEDOg:5275419  iaqRNoTCStaRJNbnv2S7Sw:5330494  transport 1676066551716 22:02:31 1.9d        10.2.0.19 2616efc46d6b

Host/Environment (please complete the following information):

OS: [e.g. iOS] Ubuntiu
Version [e.g. 22] OS 2.4

Additional context
Add any other context about the problem here.
Using Azure Blob Storage as the snapshot repo.

The text was updated successfully, but these errors were encountered:

dgilling · 2023-02-18T20:55:01Z

Update: Seems to happen consistently when multiple searches are performed in parallel against the same index or segments. ES 7 also had an issue: elastic/elasticsearch#85239

andrross · 2023-02-21T16:55:08Z

Update: Seems to happen consistently when multiple searches are performed in parallel against the same index or segments. ES 7 also had an issue: elastic/elasticsearch#85239

Thanks @dgilling! Does this mean you can reproduce this without the nested aggregation?

dgilling · 2023-02-21T19:00:19Z

@andrross , yes it's reproduced regardless of nested aggregation. It's possible the nested aggregation just helped trigger the scenario with longer search time. Usually happens when at least 12 or 15 searches hit the same remote index. Artificially staggering the searches with enough time (like a few seconds of delay between each search) helps reduce the scenario.

I also see it regardless of using Azure Blob Storage or AWS s3 as the backing remote store.

I did look into the ES 7 issue and applied a hotfix, doesn't look to be hitting that exception condition or see any other exceptions with TRACE logs.

dgilling · 2023-02-21T19:01:31Z

Updated title to reflect issue

dgilling · 2023-02-21T19:09:04Z

@andrross Should also note that attempts to cancel the tasks or set a timeout doesn't seem to clean them up. Only a restart of the nodes with search role will clear out the stuck tasks.

andrross · 2023-02-23T20:25:32Z

I was able to reproduce this by creating the nyc_taxis dataset using opensearch-benchmark and creating a searchable snapshot index from it, then running the following script to kick off 100 concurrent searches:

#!/bin/sh

for i in `seq 100`; do
  curl localhost:9200/remote_nyc_taxis/_search -H 'Content-Type:application/json' -d '{"size":0,"aggs":{"response_codes":{"terms":{"field":"payment_type"}}}}' && echo "----->$i" &
done

I took a thread dump from the resulting deadlock and it does indeed look like the issue detailed in #6437. I applied the fix from PR #6467 and was not able to reproduce the failure.

kartg · 2023-02-23T23:22:43Z

@andrross should this be closed as a dupe of #6437 ?

andrross · 2023-02-23T23:36:50Z

@kartg Let's keep this open until we get full verification of the fix

dgilling · 2023-02-24T08:57:57Z

Thanks @andrross for the quick fix.

The good news:

We backported the fix to local 2.6 branch. It indeed does fix the deadlock issue from our brief testing and do not see search task queue filling up.

The bad news:

Could be a possible memory leak. Seeing native OS memory allocation errors every 20 mins or so. Usually when pulling aggs on field data. The machine had roughly 40GB free when it occurs and not overcommitted. It could very well be unrelated to this fix as we originally was testing deadlock on 2.4.1. We'll do some more testing to see if a 2.6 issue.

example 1:

---------------  T H R E A D  ---------------

Current thread (0x00007ff730103050):  JavaThread "opensearch[622242d606c1][fetch_shard_started][T#6]" daemon [_thread_new, id=876, stack(0x00007ff688dde000,0x00007ff688edf000)]

Stack: [0x00007ff688dde000,0x00007ff688edf000],  sp=0x00007ff688edd990,  free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xede141]  VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x1a1
V  [libjvm.so+0xeded0d]  VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, __va_list_tag*)+0x2d
V  [libjvm.so+0x606e43]  report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*, ...)+0xc3
V  [libjvm.so+0xc15ab8]  os::pd_commit_memory(char*, unsigned long, bool)+0xd8
V  [libjvm.so+0xc0f3ef]  os::commit_memory(char*, unsigned long, bool)+0x1f
V  [libjvm.so+0xd965d7]  StackOverflow::create_stack_guard_pages()+0x57
V  [libjvm.so+0xe5e9c1]  JavaThread::run()+0x31
V  [libjvm.so+0xe62020]  Thread::call_run()+0xc0
V  [libjvm.so+0xc187e1]  thread_native_entry(Thread*)+0xe1

Example 2:

dgilling · 2023-03-02T01:01:52Z

@andrross @kartg for some more context, did some more testing today with official 2.6.0 image released yesterday.

Crash happens on 2.6.0 regardless if fix from [BUG] [Searchable Snapshots] Potential deadlock in ConcurrentInvocationLinearizer #6437 is included
Crash happens even if JDK is different (to rule out JDK issue). tested with OpenJDK 64-Bit Server VM (Red_Hat-17.0.5.0.8-2.el7openjdkportable) (build 17.0.5+8-LTS, mixed mode, sharing)
Looks to happen on doing a term aggregation on remote index. Haven't explored code path much.
Crash sometimes is a segfault and sometimes a malloc OOM error.

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   The process is running with CompressedOops enabled, and the Java Heap may be blocking the growth of the native heap
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
#   JVM is running with Zero Based Compressed Oops mode in which the Java heap is
#     placed in the first 32GB address space. The Java Heap base address is the
#     maximum limit for the native heap growth. Please use -XX:HeapBaseMinAddress
#     to set the Java Heap base and to place the Java Heap above 32GB virtual address.
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2787), pid=1, tid=278
#
# JRE version: OpenJDK Runtime Environment (Red_Hat-17.0.5.0.8-2.el7openjdkportable) (17.0.5+8) (build 17.0.5+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM (Red_Hat-17.0.5.0.8-2.el7openjdkportable) (17.0.5+8-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /usr/share/opensearch/core.1)
#
# JFR recording file will be written. Location: /usr/share/opensearch/hs_err_pid1.jfr
#

---------------  S U M M A R Y ------------

Host: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz, 20 cores, 157G, Amazon Linux release 2 (Karoo)
Time: Wed Mar  1 22:01:47 2023 UTC elapsed time: 620.796304 seconds (0d 0h 10m 20s)

---------------  T H R E A D  ---------------

Current thread (0x00007f2819138670):  JavaThread "C1 CompilerThread0" daemon [_thread_in_vm, id=278, stack(0x00007f27ac7f7000,0x00007f27ac8f8000)]


Current CompileTask:
C1: 620796 34468       3       org.datadog.jmxfetch.JmxAttribute::getBeanParametersList (302 bytes)

Stack: [0x00007f27ac7f7000,0x00007f27ac8f8000],  sp=0x00007f27ac8f5bf0,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xf4e5b2]  VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x1a2
V  [libjvm.so+0xf4f2bb]  VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, __va_list_tag*)+0x2b
V  [libjvm.so+0x615bf6]  report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*, ...)+0xd6
V  [libjvm.so+0xc0d21a]  os::pd_commit_memory(char*, unsigned long, unsigned long, bool)+0xda
V  [libjvm.so+0xc0624e]  os::commit_memory(char*, unsigned long, unsigned long, bool)+0x2e
V  [libjvm.so+0xf4600b]  VirtualSpace::expand_by(unsigned long, bool)+0x15b
V  [libjvm.so+0x7d23f6]  CodeHeap::expand_by(unsigned long)+0x96
V  [libjvm.so+0x5b147a]  CodeCache::allocate(int, int, bool, int)+0x9a
V  [libjvm.so+0xbcab42]  nmethod::new_nmethod(methodHandle const&, int, int, CodeOffsets*, int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, int, GrowableArrayView<RuntimeStub*> const&, char*, int, int, char const*, FailedSpeculation**)+0x212
V  [libjvm.so+0x54eb2f]  ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, bool, bool, RTMState, GrowableArrayView<RuntimeStub*> const&)+0x41f
V  [libjvm.so+0x472c5c]  Compilation::compile_method()+0x33c
V  [libjvm.so+0x472efb]  Compilation::Compilation(AbstractCompiler*, ciEnv*, ciMethod*, int, BufferBlob*, bool, DirectiveSet*)+0x22b
V  [libjvm.so+0x473a03]  Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0xc3
V  [libjvm.so+0x5e9eb1]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xea1
V  [libjvm.so+0x5eaae8]  CompileBroker::compiler_thread_loop()+0x508
V  [libjvm.so+0xed0737]  JavaThread::thread_main_inner()+0xc7
V  [libjvm.so+0xed379d]  Thread::call_run()+0x6d
V  [libjvm.so+0xc0fe11]  thread_native_entry(Thread*)+0xe1


Java Threads: ( => current thread )
  0x00007f281912d390 JavaThread "Reference Handler" daemon [_thread_blocked, id=272, stack(0x00007f27acdfd000,0x00007f27acefe000)]
  0x00007f281912e770 JavaThread "Finalizer" daemon [_thread_blocked, id=273, stack(0x00007f27accfc000,0x00007f27acdfd000)]
  0x00007f2819132f50 JavaThread "Signal Dispatcher" daemon [_thread_blocked, id=274, stack(0x00007f27acbfb000,0x00007f27accfc000)]
  0x00007f2819134300 JavaThread "Service Thread" daemon [_thread_blocked, id=275, stack(0x00007f27acafa000,0x00007f27acbfb000)]
  0x00007f2819135710 JavaThread "Monitor Deflation Thread" daemon [_thread_blocked, id=276, stack(0x00007f27ac9f9000,0x00007f27acafa000)]
  0x00007f2819137140 JavaThread "C2 CompilerThread0" daemon [_thread_blocked, id=277, stack(0x00007f27ac8f8000,0x00007f27ac9f9000)]
=>0x00007f2819138670 JavaThread "C1 CompilerThread0" daemon [_thread_in_vm, id=278, stack(0x00007f27ac7f7000,0x00007f27ac8f8000)]
  0x00007f2819139ae0 JavaThread "Sweeper thread" daemon [_thread_blocked, id=279, stack(0x00007f27ac6f6000,0x00007f27ac7f7000)]
  0x00007f2819144d70 JavaThread "Common-Cleaner" daemon [_thread_blocked, id=280, stack(0x00007f27ac5f5000,0x00007f27ac6f6000)]
  0x00007f2819870b90 JavaThread "dd-task-scheduler" daemon [_thread_blocked, id=284, stack(0x00007f273eaff000,0x00007f273ec00000)]
  0x00007f281a3320a0 JavaThread "OkHttp ConnectionPool" daemon [_thread_blocked, id=288, stack(0x00007f2727eff000,0x00007f2728000000)]
  0x00007f281a3342a0 JavaThread "Okio Watchdog" daemon [_thread_blocked, id=289, stack(0x00007f2727dfe000,0x00007f2727eff000)]
  0x00007f281a346510 JavaThread "dd-trace-monitor" daemon [_thread_blocked, id=290, stack(0x00007f2727cfd000,0x00007f2727dfe000)]
  0x00007f281a345260 JavaThread "dd-trace-processor" daemon [_thread_blocked, id=291, stack(0x00007f2727bfc000,0x00007f2727cfd000)]
  0x00007f281a3c52c0 JavaThread "dd-remote-config" daemon [_thread_blocked, id=292, stack(0x00007f2727afb000,0x00007f2727bfc000)]
  0x00007f281a3d7bc0 JavaThread "dd-telemetry" daemon [_thread_blocked, id=293, stack(0x00007f27279fa000,0x00007f2727afb000)]
  0x00007f281a405860 JavaThread "dd-profiler-recording-scheduler" daemon [_thread_blocked, id=294, stack(0x00007f27278f9000,0x00007f27279fa000)]
  0x00007f281a3f7bc0 JavaThread "Notification Thread" daemon [_thread_blocked, id=295, stack(0x00007f27277f8000,0x00007f27278f9000)]
  0x00007f281ba14120 JavaThread "ScheduledMetricCollectorsExecutor" [_thread_blocked, id=308, stack(0x00007f27275f5000,0x00007f27276f6000)]
  0x00007f281bb511c0 JavaThread "pool-3-thread-1" [_thread_blocked, id=309, stack(0x00007f27264e0000,0x00007f27265e1000)]
  0x00007f281bb686d0 JavaThread "Thread-5" [_thread_blocked, id=310, stack(0x00007f2725aff000,0x00007f2725c00000)]
  0x00007f26d84aede0 JavaThread "opensearch[5410a48b36ce][[timer]]" daemon [_thread_blocked, id=313, stack(0x00007f27ac4f4000,0x00007f27ac5f5000)]
  0x00007f26d84b62f0 JavaThread "opensearch[5410a48b36ce][scheduler][T#1]" daemon [_thread_blocked, id=314, stack(0x00007f273f485000,0x00007f273f586000)]
  0x00007f26d8e61b60 JavaThread "commons-pool-evictor" daemon [_thread_blocked, id=323, stack(0x00007f27243fc000,0x00007f27244fd000)]
  0x00007f26d9bde1c0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#1]" daemon [_thread_in_native, id=326, stack(0x00007f273c809000,0x00007f273c90a000)]
  0x00007f273026a230 JavaThread "opensearch[5410a48b36ce][generic][T#1]" daemon [_thread_blocked, id=332, stack(0x00007f27241fa000,0x00007f27242fb000)]
  0x00007f273013d100 JavaThread "opensearch[5410a48b36ce][generic][T#2]" daemon [_thread_blocked, id=333, stack(0x00007f273c90a000,0x00007f273ca0b000)]
  0x00007f2730153eb0 JavaThread "opensearch[5410a48b36ce][generic][T#3]" daemon [_thread_blocked, id=336, stack(0x00007f27246ff000,0x00007f2724800000)]
  0x00007f2730151650 JavaThread "opensearch[5410a48b36ce][generic][T#4]" daemon [_thread_blocked, id=337, stack(0x00007f27265e1000,0x00007f27266e2000)]
  0x00007f27340387c0 JavaThread "statsd-aggregator-thread" daemon [_thread_blocked, id=344, stack(0x00007f27258fd000,0x00007f27259fe000)]
  0x00007f273403fd90 JavaThread "StatsD-Processor-1" daemon [_thread_blocked, id=345, stack(0x00007f27245fe000,0x00007f27246ff000)]
  0x00007f2734040b60 JavaThread "StatsD-Sender-1" daemon [_thread_blocked, id=346, stack(0x00007f27244fd000,0x00007f27245fe000)]
  0x00007f27340fe010 JavaThread "dd-jmx-collector" daemon [_thread_blocked, id=357, stack(0x00007f26c00fd000,0x00007f26c01fe000)]
  0x00007f281bd887b0 JavaThread "opendistro_job_sweeper[T#1]" daemon [_thread_blocked, id=380, stack(0x00007f27259fe000,0x00007f2725aff000)]
  0x00007f281a4fb010 JavaThread "opensearch[5410a48b36ce][clusterApplierService#updateTask][T#1]" daemon [_thread_blocked, id=384, stack(0x00007f26c01fe000,0x00007f26c02ff000)]
  0x00007f26b40092f0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#4]" daemon [_thread_in_native, id=385, stack(0x00007f2697eff000,0x00007f2698000000)]
  0x00007f26dc00ab80 JavaThread "opensearch[5410a48b36ce][transport_worker][T#3]" daemon [_thread_in_native, id=386, stack(0x00007f2697dfe000,0x00007f2697eff000)]
  0x00007f26c499af50 JavaThread "opensearch[5410a48b36ce][transport_worker][T#2]" daemon [_thread_in_native, id=387, stack(0x00007f2697cfd000,0x00007f2697dfe000)]
  0x00007f26b4011d70 JavaThread "opensearch[5410a48b36ce][generic][T#6]" daemon [_thread_blocked, id=388, stack(0x00007f2697bfc000,0x00007f2697cfd000)]
  0x00007f26c4006060 JavaThread "opensearch[5410a48b36ce][generic][T#5]" daemon [_thread_blocked, id=389, stack(0x00007f2697afb000,0x00007f2697bfc000)]
  0x00007f26c4066220 JavaThread "opensearch[5410a48b36ce][transport_worker][T#6]" daemon [_thread_in_native, id=390, stack(0x00007f26979fa000,0x00007f2697afb000)]
  0x00007f26dc01d710 JavaThread "opensearch[5410a48b36ce][transport_worker][T#7]" daemon [_thread_in_native, id=391, stack(0x00007f26978f9000,0x00007f26979fa000)]
  0x00007f26b4009cb0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#5]" daemon [_thread_in_native, id=392, stack(0x00007f26977f8000,0x00007f26978f9000)]
  0x00007f26c4160d80 JavaThread "opensearch[5410a48b36ce][transport_worker][T#8]" daemon [_thread_in_native, id=393, stack(0x00007f26976f7000,0x00007f26977f8000)]
  0x00007f26dc01dc80 JavaThread "opensearch[5410a48b36ce][transport_worker][T#9]" daemon [_thread_in_native, id=394, stack(0x00007f26975f6000,0x00007f26976f7000)]
  0x00007f26c4998ec0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#11]" daemon [_thread_in_native, id=395, stack(0x00007f26974f5000,0x00007f26975f6000)]
  0x00007f26b401b880 JavaThread "opensearch[5410a48b36ce][transport_worker][T#10]" daemon [_thread_in_native, id=396, stack(0x00007f26973f4000,0x00007f26974f5000)]
  0x00007f26dc0472c0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#12]" daemon [_thread_in_native, id=397, stack(0x00007f26972f3000,0x00007f26973f4000)]
  0x00007f26c4999d00 JavaThread "opensearch[5410a48b36ce][transport_worker][T#13]" daemon [_thread_in_native, id=398, stack(0x00007f26971f2000,0x00007f26972f3000)]
  0x00007f26b401c2f0 JavaThread "opensearch[5410a48b36ce][transport_worker][T#14]" daemon [_thread_in_native, id=399, stack(0x00007f26970f1000,0x00007f26971f2000)]
  0x00007f26dc047830 JavaThread "opensearch[5410a48b36ce][transport_worker][T#15]" daemon [_thread_in_native, id=400, stack(0x00007f2696ff0000,0x00007f26970f1000)]
  0x00007f26c4127390 JavaThread "opensearch[5410a48b36ce][transport_worker][T#16]" daemon [_thread_in_native, id=401, stack(0x00007f2696eef000,0x00007f2696ff0000)]
  0x00007f26b401d290 JavaThread "opensearch[5410a48b36ce][transport_worker][T#17]" daemon [_thread_in_native, id=402, stack(0x00007f2696dee000,0x00007f2696eef000)]
  0x00007f26dc04b090 JavaThread "opensearch[5410a48b36ce][transport_worker][T#18]" daemon [_thread_in_native, id=403, stack(0x00007f2696ced000,0x00007f2696dee000)]
  0x00007f26b4060d20 JavaThread "opensearch[5410a48b36ce][transport_worker][T#19]" daemon [_thread_in_native, id=404, stack(0x00007f2696bec000,0x00007f2696ced000)]
  0x00007f26dc04c120 JavaThread "opensearch[5410a48b36ce][transport_worker][T#20]" daemon [_thread_in_native, id=405, stack(0x00007f2696aeb000,0x00007f2696bec000)]
  0x00007f2644105c00 JavaThread "opensearch[5410a48b36ce][management][T#1]" daemon [_thread_blocked, id=413, stack(0x00007f26c0dfe000,0x00007f26c0eff000)]
  0x00007f2688121d90 JavaThread "opensearch[5410a48b36ce][management][T#2]" daemon [_thread_blocked, id=420, stack(0x00007f26966e7000,0x00007f26967e8000)]
  0x00007f2738727970 JavaThread "opensearch[5410a48b36ce][AsyncLucenePersistedState#updateTask][T#1]" daemon [_thread_blocked, id=424, stack(0x00007f273c708000,0x00007f273c809000)]
  0x00007f2708e31ab0 JavaThread "JFR Recorder Thread" daemon [_thread_blocked, id=431, stack(0x00007f26961e1000,0x00007f26962e2000)]
  0x00007f2708e523d0 JavaThread "JFR Periodic Tasks" daemon [_thread_blocked, id=433, stack(0x00007f2695160000,0x00007f2695261000)]
  0x00007f2668144820 JavaThread "opensearch[5410a48b36ce][fetch_shard_started][T#1]" daemon [_thread_blocked, id=439, stack(0x00007f26963e3000,0x00007f26964e4000)]
  0x00007f269c4b47a0 JavaThread "opensearch[5410a48b36ce][DanglingIndices#updateTask][T#1]" daemon [_thread_blocked, id=443, stack(0x00007f26969ea000,0x00007f2696aeb000)]
  0x00007f269c5ef780 JavaThread "opensearch[5410a48b36ce][snapshot][T#1]" daemon [_thread_blocked, id=444, stack(0x00007f26964e4000,0x00007f26965e5000)]
  0x00007f262838eab0 JavaThread "boundedElastic-evictor-1" daemon [_thread_blocked, id=445, stack(0x00007f26968e9000,0x00007f26969ea000)]
  0x00007f26200b9d00 JavaThread "reactor-nio-1-thread-1" [_thread_in_native, id=448, stack(0x00007f26956ff000,0x00007f2695800000)]
  0x00007f260412e5c0 JavaThread "parallel-1" daemon [_thread_blocked, id=449, stack(0x00007f26955fe000,0x00007f26956ff000)]
  0x00007f260413a6e0 JavaThread "parallel-2" daemon [_thread_blocked, id=450, stack(0x00007f26954fd000,0x00007f26955fe000)]
  0x00007f26dc0493e0 JavaThread "index-input-cleaner[T#1]" daemon [_thread_blocked, id=451, stack(0x00007f26953fc000,0x00007f26954fd000)]
  0x00007f25f8004130 JavaThread "reactor-nio-1-thread-2" [_thread_in_native, id=453, stack(0x00007f269405d000,0x00007f269415e000)]
  0x00007f260413d480 JavaThread "parallel-3" daemon [_thread_blocked, id=454, stack(0x00007f260aeff000,0x00007f260b000000)]
  0x00007f260413e3f0 JavaThread "parallel-4" daemon [_thread_blocked, id=455, stack(0x00007f260adfe000,0x00007f260aeff000)]
  0x00007f25e8003b70 JavaThread "reactor-nio-1-thread-3" [_thread_in_native, id=457, stack(0x00007f260abfc000,0x00007f260acfd000)]
  0x00007f25ec00c960 JavaThread "parallel-5" daemon [_thread_blocked, id=458, stack(0x00007f260aafb000,0x00007f260abfc000)]
  0x00007f25ec00d910 JavaThread "parallel-6" daemon [_thread_blocked, id=459, stack(0x00007f260a9fa000,0x00007f260aafb000)]
  0x00007f25dc00bd50 JavaThread "parallel-7" daemon [_thread_blocked, id=460, stack(0x00007f260a8f9000,0x00007f260a9fa000)]
  0x00007f25dc00ca70 JavaThread "parallel-8" daemon [_thread_blocked, id=461, stack(0x00007f260a7f8000,0x00007f260a8f9000)]
  0x00007f25d0003ca0 JavaThread "reactor-nio-1-thread-4" [_thread_in_native, id=463, stack(0x00007f260a5f6000,0x00007f260a6f7000)]
  0x00007f25ec00f830 JavaThread "parallel-9" daemon [_thread_blocked, id=464, stack(0x00007f260a4f5000,0x00007f260a5f6000)]
  0x00007f25ec005120 JavaThread "parallel-10" daemon [_thread_blocked, id=465, stack(0x00007f260a3f4000,0x00007f260a4f5000)]
  0x00007f25ec006280 JavaThread "parallel-11" daemon [_thread_blocked, id=466, stack(0x00007f260a2f3000,0x00007f260a3f4000)]
  0x00007f25ec007530 JavaThread "parallel-12" daemon [_thread_blocked, id=467, stack(0x00007f260a1f2000,0x00007f260a2f3000)]
  0x00007f25ec0088e0 JavaThread "parallel-13" daemon [_thread_blocked, id=468, stack(0x00007f260a0f1000,0x00007f260a1f2000)]
  0x00007f25ec009c30 JavaThread "parallel-14" daemon [_thread_blocked, id=470, stack(0x00007f2609ff0000,0x00007f260a0f1000)]
  0x00007f25dc004e10 JavaThread "parallel-15" daemon [_thread_blocked, id=471, stack(0x00007f2609eef000,0x00007f2609ff0000)]
  0x00007f25dc006190 JavaThread "parallel-16" daemon [_thread_blocked, id=472, stack(0x00007f2609dee000,0x00007f2609eef000)]
  0x00007f25dc0077d0 JavaThread "parallel-17" daemon [_thread_blocked, id=473, stack(0x00007f2609ced000,0x00007f2609dee000)]
  0x00007f25dc009370 JavaThread "parallel-18" daemon [_thread_blocked, id=474, stack(0x00007f2609bec000,0x00007f2609ced000)]
  0x00007f25ec00af60 JavaThread "parallel-19" daemon [_thread_blocked, id=475, stack(0x00007f2609aeb000,0x00007f2609bec000)]
  0x00007f25ec00c2e0 JavaThread "parallel-20" daemon [_thread_blocked, id=476, stack(0x00007f26099ea000,0x00007f2609aeb000)]
  0x00007f25d0005060 JavaThread "reactor-nio-1-thread-5" [_thread_in_native, id=478, stack(0x00007f26097e8000,0x00007f26098e9000)]
  0x00007f25e8005f40 JavaThread "reactor-nio-1-thread-6" [_thread_in_native, id=480, stack(0x00007f26095e6000,0x00007f26096e7000)]
  0x00007f25f8005530 JavaThread "reactor-nio-1-thread-7" [_thread_in_native, id=482, stack(0x00007f26093e4000,0x00007f26094e5000)]
  0x00007f269c2c2160 JavaThread "opensearch[5410a48b36ce][generic][T#7]" daemon [_thread_blocked, id=483, stack(0x00007f26092e3000,0x00007f26093e4000)]
  0x00007f25740047d0 JavaThread "reactor-nio-1-thread-8" [_thread_in_native, id=485, stack(0x00007f26090e1000,0x00007f26091e2000)]
  0x00007f269c2c5740 JavaThread "opensearch[5410a48b36ce][generic][T#8]" daemon [_thread_blocked, id=487, stack(0x00007f2608edf000,0x00007f2608fe0000)]
  0x00007f2564003b60 JavaThread "reactor-nio-1-thread-9" [_thread_in_native, id=489, stack(0x00007f2608cdd000,0x00007f2608dde000)]
  0x00007f273026b3a0 JavaThread "opensearch[5410a48b36ce][generic][T#9]" daemon [_thread_blocked, id=490, stack(0x00007f2608bdc000,0x00007f2608cdd000)]
  0x00007f2560003dc0 JavaThread "reactor-nio-1-thread-10" [_thread_in_native, id=492, stack(0x00007f26089da000,0x00007f2608adb000)]
  0x00007f269c2c6b00 JavaThread "opensearch[5410a48b36ce][generic][T#10]" daemon [_thread_blocked, id=493, stack(0x00007f26088d9000,0x00007f26089da000)]
  0x00007f26200c6ad0 JavaThread "reactor-nio-1-thread-11" [_thread_in_native, id=495, stack(0x00007f26086d7000,0x00007f26087d8000)]
  0x00007f269c2ca230 JavaThread "opensearch[5410a48b36ce][generic][T#11]" daemon [_thread_blocked, id=498, stack(0x00007f26083d4000,0x00007f26084d5000)]
  0x00007f25400045d0 JavaThread "reactor-nio-1-thread-12" [_thread_in_native, id=500, stack(0x00007f26081d2000,0x00007f26082d3000)]
  0x00007f269c2cbf20 JavaThread "opensearch[5410a48b36ce][generic][T#12]" daemon [_thread_blocked, id=502, stack(0x00007f2533eff000,0x00007f2534000000)]
  0x00007f2528004260 JavaThread "reactor-nio-1-thread-13" [_thread_in_native, id=504, stack(0x00007f2533cfd000,0x00007f2533dfe000)]
  0x00007f269c2cd3a0 JavaThread "opensearch[5410a48b36ce][generic][T#13]" daemon [_thread_blocked, id=505, stack(0x00007f2533bfc000,0x00007f2533cfd000)]
  0x00007f2520011090 JavaThread "reactor-nio-1-thread-14" [_thread_in_native, id=507, stack(0x00007f25339fa000,0x00007f2533afb000)]
  0x00007f269c2ce0b0 JavaThread "opensearch[5410a48b36ce][generic][T#14]" daemon [_thread_blocked, id=508, stack(0x00007f25338f9000,0x00007f25339fa000)]
  0x00007f269c2cf1c0 JavaThread "opensearch[5410a48b36ce][generic][T#15]" daemon [_thread_blocked, id=509, stack(0x00007f25337f8000,0x00007f25338f9000)]
  0x00007f254c004610 JavaThread "reactor-nio-1-thread-15" [_thread_in_native, id=511, stack(0x00007f25335f6000,0x00007f25336f7000)]
  0x00007f2520007cd0 JavaThread "reactor-nio-1-thread-16" [_thread_in_native, id=513, stack(0x00007f25333f4000,0x00007f25334f5000)]
  0x00007f269c2d05a0 JavaThread "opensearch[5410a48b36ce][generic][T#16]" daemon [_thread_blocked, id=514, stack(0x00007f25332f3000,0x00007f25333f4000)]
  0x00007f25600059f0 JavaThread "reactor-nio-1-thread-17" [_thread_in_native, id=516, stack(0x00007f25330f1000,0x00007f25331f2000)]
  0x00007f269c2d10d0 JavaThread "opensearch[5410a48b36ce][generic][T#17]" daemon [_thread_blocked, id=517, stack(0x00007f2532ff0000,0x00007f25330f1000)]
  0x00007f25200099d0 JavaThread "reactor-nio-1-thread-18" [_thread_in_native, id=519, stack(0x00007f2532dee000,0x00007f2532eef000)]
  0x00007f269c2d24f0 JavaThread "opensearch[5410a48b36ce][generic][T#18]" daemon [_thread_blocked, id=520, stack(0x00007f2532ced000,0x00007f2532dee000)]
  0x00007f256800bef0 JavaThread "reactor-nio-1-thread-19" [_thread_in_native, id=522, stack(0x00007f2532aeb000,0x00007f2532bec000)]
  0x00007f269c2d3520 JavaThread "opensearch[5410a48b36ce][generic][T#19]" daemon [_thread_blocked, id=523, stack(0x00007f25329ea000,0x00007f2532aeb000)]
  0x00007f256c097450 JavaThread "reactor-nio-1-thread-20" [_thread_in_native, id=525, stack(0x00007f25327e8000,0x00007f25328e9000)]
  0x00007f2730152cf0 JavaThread "opensearch[5410a48b36ce][generic][T#20]" daemon [_thread_blocked, id=526, stack(0x00007f25326e7000,0x00007f25327e8000)]
  0x00007f269c125d60 JavaThread "opensearch[5410a48b36ce][generic][T#21]" daemon [_thread_blocked, id=535, stack(0x00007f26960e0000,0x00007f26961e1000)]
  0x00007f273014e720 JavaThread "opensearch[5410a48b36ce][refresh][T#1]" daemon [_thread_blocked, id=536, stack(0x00007f27242fb000,0x00007f27243fc000)]
  0x00007f281a391f80 JavaThread "opensearch[keepAlive/2.6.0]" [_thread_blocked, id=540, stack(0x00007f2532097000,0x00007f2532198000)]
  0x00007f281802b9d0 JavaThread "DestroyJavaVM" [_thread_blocked, id=251, stack(0x00007f2821472000,0x00007f2821573000)]
  0x00007f2664148500 JavaThread "opensearch[5410a48b36ce][fetch_shard_store][T#1]" daemon [_thread_blocked, id=541, stack(0x00007f26c0cfd000,0x00007f26c0dfe000)]
  0x00007f27302700e0 JavaThread "opensearch[5410a48b36ce][flush][T#1]" daemon [_thread_blocked, id=543, stack(0x00007f26967e8000,0x00007f26968e9000)]
  0x00007f269c312470 JavaThread "opensearch[5410a48b36ce][generic][T#22]" daemon [_thread_blocked, id=557, stack(0x00007f252719e000,0x00007f252729f000)]
  0x00007f25800674d0 JavaThread "opensearch[5410a48b36ce][management][T#3]" daemon [_thread_blocked, id=578, stack(0x00007f26080d1000,0x00007f26081d2000)]
  0x00007f26c404ac10 JavaThread "opensearch[5410a48b36ce][management][T#4]" daemon [_thread_blocked, id=579, stack(0x00007f2608fe0000,0x00007f26090e1000)]
  0x00007f26c408e030 JavaThread "opensearch[5410a48b36ce][management][T#5]" daemon [_thread_blocked, id=583, stack(0x00007f27240f9000,0x00007f27241fa000)]
  0x00007f252002ae00 JavaThread "reactor-nio-1-thread-21" [_thread_in_native, id=599, stack(0x00007f2526f0a000,0x00007f252700b000)]
  0x00007f269c3f3d00 JavaThread "opensearch[5410a48b36ce][generic][T#23]" daemon [_thread_blocked, id=600, stack(0x00007f2526572000,0x00007f2526673000)]
  0x00007f269c3f4db0 JavaThread "opensearch[5410a48b36ce][generic][T#24]" daemon [_thread_blocked, id=601, stack(0x00007f26084d5000,0x00007f26085d6000)]
  0x00007f269c3f5e20 JavaThread "opensearch[5410a48b36ce][generic][T#25]" daemon [_thread_blocked, id=602, stack(0x00007f2518784000,0x00007f2518885000)]
  0x00007f253402fa20 JavaThread "reactor-nio-1-thread-22" [_thread_in_native, id=604, stack(0x00007f2518364000,0x00007f2518465000)]
  0x00007f25f800cfb0 JavaThread "reactor-nio-1-thread-23" [_thread_in_native, id=606, stack(0x00007f25180ff000,0x00007f2518200000)]
  0x00007f269c3f7d50 JavaThread "opensearch[5410a48b36ce][generic][T#26]" daemon [_thread_blocked, id=607, stack(0x00007f2517fd8000,0x00007f25180d9000)]
  0x00007f253c098a50 JavaThread "reactor-nio-1-thread-24" [_thread_in_native, id=609, stack(0x00007f2517d0d000,0x00007f2517e0e000)]
  0x00007f2709059320 JavaThread "dd-profiler-http-dispatcher" daemon [_thread_blocked, id=616, stack(0x00007f26085d6000,0x00007f26086d7000)]
  0x00007f26ac588b70 JavaThread "OkHttp ConnectionPool" daemon [_thread_blocked, id=617, stack(0x00007f273fce9000,0x00007f273fdea000)]
  0x00007f27300e05b0 JavaThread "opensearch[5410a48b36ce][flush][T#2]" daemon [_thread_blocked, id=622, stack(0x00007f252729f000,0x00007f25273a0000)]
  0x00007f27300e1060 JavaThread "opensearch[5410a48b36ce][flush][T#3]" daemon [_thread_blocked, id=623, stack(0x00007f27268fb000,0x00007f27269fc000)]
  0x00007f269c1707f0 JavaThread "opensearch[5410a48b36ce][generic][T#27]" daemon [_thread_blocked, id=635, stack(0x00007f26c02ff000,0x00007f26c0400000)]
  0x00007f25640569a0 JavaThread "reactor-nio-1-thread-25" [_thread_in_native, id=637, stack(0x00007f2513f51000,0x00007f2514052000)]
  0x00007f269c1715d0 JavaThread "opensearch[5410a48b36ce][generic][T#28]" daemon [_thread_blocked, id=638, stack(0x00007f250f33d000,0x00007f250f43e000)]
  0x00007f26d4004d70 JavaThread "reactor-nio-1-thread-26" [_thread_in_native, id=640, stack(0x00007f250f131000,0x00007f250f232000)]
  0x00007f269c09a0b0 JavaThread "opensearch[5410a48b36ce][generic][T#29]" daemon [_thread_blocked, id=642, stack(0x00007f250edf7000,0x00007f250eef8000)]
  0x00007f269c09abb0 JavaThread "opensearch[5410a48b36ce][generic][T#30]" daemon [_thread_blocked, id=643, stack(0x00007f250ecf6000,0x00007f250edf7000)]
  0x00007f25740bd880 JavaThread "reactor-nio-1-thread-27" [_thread_in_native, id=645, stack(0x00007f250eae8000,0x00007f250ebe9000)]
  0x00007f27300c0360 JavaThread "opensearch[5410a48b36ce][flush][T#4]" daemon [_thread_blocked, id=663, stack(0x00007f26c0eff000,0x00007f26c1000000)]
  0x00007f27300c0e70 JavaThread "opensearch[5410a48b36ce][flush][T#5]" daemon [_thread_blocked, id=664, stack(0x00007f250f00b000,0x00007f250f10c000)]
  0x00007f27a0006340 JavaThread "reactor-nio-1-thread-28" [_thread_in_native, id=689, stack(0x00007f250d763000,0x00007f250d864000)]
  0x00007f27301d4b40 JavaThread "opensearch[5410a48b36ce][generic][T#31]" daemon [_thread_blocked, id=690, stack(0x00007f2509391000,0x00007f2509492000)]
  0x00007f27301d5670 JavaThread "opensearch[5410a48b36ce][generic][T#32]" daemon [_thread_blocked, id=691, stack(0x00007f2503e2b000,0x00007f2503f2c000)]
  0x00007f27301d50b0 JavaThread "opensearch[5410a48b36ce][generic][T#33]" daemon [_thread_blocked, id=692, stack(0x00007f2500649000,0x00007f250074a000)]
  0x00007f26695ad2b0 JavaThread "opensearch[5410a48b36ce][search][T#1]" daemon [_thread_blocked, id=695, stack(0x00007f24e0e76000,0x00007f24e0f77000)]
  0x00007f26694e0580 JavaThread "opensearch[5410a48b36ce][search][T#2]" daemon [_thread_blocked, id=696, stack(0x00007f24e0d73000,0x00007f24e0e74000)]
  0x00007f26694df770 JavaThread "opensearch[5410a48b36ce][search][T#3]" daemon [_thread_blocked, id=698, stack(0x00007f24e0c72000,0x00007f24e0d73000)]
  0x00007f26694e18d0 JavaThread "opensearch[5410a48b36ce][search][T#4]" daemon [_thread_blocked, id=699, stack(0x00007f24e0b71000,0x00007f24e0c72000)]
  0x00007f2669651720 JavaThread "opensearch[5410a48b36ce][search][T#5]" daemon [_thread_blocked, id=700, stack(0x00007f24e0a70000,0x00007f24e0b71000)]
  0x00007f2669653b40 JavaThread "opensearch[5410a48b36ce][search][T#6]" daemon [_thread_blocked, id=701, stack(0x00007f24e096f000,0x00007f24e0a70000)]
  0x00007f2669656830 JavaThread "opensearch[5410a48b36ce][search][T#7]" daemon [_thread_blocked, id=702, stack(0x00007f24e086e000,0x00007f24e096f000)]
  0x00007f2669657c00 JavaThread "opensearch[5410a48b36ce][search][T#8]" daemon [_thread_blocked, id=703, stack(0x00007f24e076d000,0x00007f24e086e000)]
  0x00007f266965a080 JavaThread "opensearch[5410a48b36ce][search][T#9]" daemon [_thread_blocked, id=704, stack(0x00007f24e066c000,0x00007f24e076d000)]
  0x00007f266965b4a0 JavaThread "opensearch[5410a48b36ce][search][T#10]" daemon [_thread_blocked, id=705, stack(0x00007f24e056b000,0x00007f24e066c000)]
  0x00007f266965c8d0 JavaThread "opensearch[5410a48b36ce][search][T#11]" daemon [_thread_blocked, id=706, stack(0x00007f24e046a000,0x00007f24e056b000)]
  0x00007f2669666cd0 JavaThread "opensearch[5410a48b36ce][search][T#12]" daemon [_thread_blocked, id=707, stack(0x00007f24e0369000,0x00007f24e046a000)]
  0x00007f26696699d0 JavaThread "opensearch[5410a48b36ce][search][T#13]" daemon [_thread_blocked, id=708, stack(0x00007f24e0268000,0x00007f24e0369000)]
  0x00007f266966ad90 JavaThread "opensearch[5410a48b36ce][search][T#14]" daemon [_thread_blocked, id=709, stack(0x00007f24e0165000,0x00007f24e0266000)]
  0x00007f2669675210 JavaThread "opensearch[5410a48b36ce][search][T#15]" daemon [_thread_blocked, id=710, stack(0x00007f24e0064000,0x00007f24e0165000)]
  0x00007f26696865f0 JavaThread "opensearch[5410a48b36ce][search][T#16]" daemon [_thread_blocked, id=711, stack(0x00007f24dff63000,0x00007f24e0064000)]
  0x00007f2669690230 JavaThread "opensearch[5410a48b36ce][search][T#17]" daemon [_thread_blocked, id=712, stack(0x00007f24dfe62000,0x00007f24dff63000)]
  0x00007f266969bcf0 JavaThread "opensearch[5410a48b36ce][search][T#18]" daemon [_thread_blocked, id=713, stack(0x00007f24dfd61000,0x00007f24dfe62000)]
  0x00007f266969ae60 JavaThread "opensearch[5410a48b36ce][search][T#19]" daemon [_thread_blocked, id=714, stack(0x00007f24dfc60000,0x00007f24dfd61000)]
  0x00007f266969cad0 JavaThread "opensearch[5410a48b36ce][search][T#20]" daemon [_thread_blocked, id=715, stack(0x00007f24dfb5f000,0x00007f24dfc60000)]
  0x00007f266969dee0 JavaThread "opensearch[5410a48b36ce][search][T#21]" daemon [_thread_blocked, id=716, stack(0x00007f24dfa5e000,0x00007f24dfb5f000)]
  0x00007f266969fb20 JavaThread "opensearch[5410a48b36ce][search][T#22]" daemon [_thread_blocked, id=717, stack(0x00007f24df95d000,0x00007f24dfa5e000)]
  0x00007f26696a8f10 JavaThread "opensearch[5410a48b36ce][search][T#23]" daemon [_thread_blocked, id=718, stack(0x00007f24df85c000,0x00007f24df95d000)]
  0x00007f26696aa300 JavaThread "opensearch[5410a48b36ce][search][T#24]" daemon [_thread_blocked, id=719, stack(0x00007f24df75b000,0x00007f24df85c000)]
  0x00007f26696ab700 JavaThread "opensearch[5410a48b36ce][search][T#25]" daemon [_thread_blocked, id=720, stack(0x00007f24df65a000,0x00007f24df75b000)]
  0x00007f26696acb30 JavaThread "opensearch[5410a48b36ce][search][T#26]" daemon [_thread_blocked, id=721, stack(0x00007f24df557000,0x00007f24df658000)]
  0x00007f26696adb20 JavaThread "opensearch[5410a48b36ce][search][T#27]" daemon [_thread_blocked, id=722, stack(0x00007f24df456000,0x00007f24df557000)]
  0x00007f26696afbb0 JavaThread "opensearch[5410a48b36ce][search][T#28]" daemon [_thread_blocked, id=723, stack(0x00007f24df355000,0x00007f24df456000)]
  0x00007f26696b0fb0 JavaThread "opensearch[5410a48b36ce][search][T#29]" daemon [_thread_blocked, id=724, stack(0x00007f24df252000,0x00007f24df353000)]
  0x00007f26696b23b0 JavaThread "opensearch[5410a48b36ce][search][T#30]" daemon [_thread_blocked, id=725, stack(0x00007f24df151000,0x00007f24df252000)]
  0x00007f26696c37a0 JavaThread "opensearch[5410a48b36ce][search][T#31]" daemon [_thread_blocked, id=726, stack(0x00007f24df050000,0x00007f24df151000)]
  0x00007f2638c463b0 JavaThread "reactor-nio-1-thread-29" [_thread_in_native, id=765, stack(0x00007f2533dfe000,0x00007f2533eff000)]
  0x00007f27301ce8c0 JavaThread "opensearch[5410a48b36ce][opensearch_asynchronous_search_generic][T#1]" daemon [_thread_blocked, id=771, stack(0x00007f24db601000,0x00007f24db702000)]
  0x00007f269c53d210 JavaThread "opensearch[5410a48b36ce][generic][T#34]" daemon [_thread_blocked, id=772, stack(0x00007f24def47000,0x00007f24df048000)]
  0x00007f269c03a060 JavaThread "opensearch[5410a48b36ce][generic][T#35]" daemon [_thread_blocked, id=773, stack(0x00007f24db3ff000,0x00007f24db500000)]
  0x00007f269c48d4e0 JavaThread "opensearch[5410a48b36ce][generic][T#36]" daemon [_thread_blocked, id=774, stack(0x00007f24e0f7f000,0x00007f24e1080000)]
  0x00007f269c039670 JavaThread "opensearch[5410a48b36ce][generic][T#37]" daemon [_thread_blocked, id=775, stack(0x00007f24db500000,0x00007f24db601000)]
  0x00007f269c48df00 JavaThread "opensearch[5410a48b36ce][generic][T#38]" daemon [_thread_blocked, id=776, stack(0x00007f24c9415000,0x00007f24c9516000)]
  0x00007f269c03b110 JavaThread "opensearch[5410a48b36ce][generic][T#39]" daemon [_thread_blocked, id=777, stack(0x00007f24c928c000,0x00007f24c938d000)]
  0x00007f25640316b0 JavaThread "reactor-nio-1-thread-30" [_thread_in_native, id=779, stack(0x00007f24c6051000,0x00007f24c6152000)]
  0x00007f261c29a1a0 JavaThread "reactor-nio-1-thread-31" [_thread_in_native, id=781, stack(0x00007f24c5da3000,0x00007f24c5ea4000)]
  0x00007f26240882a0 JavaThread "reactor-nio-1-thread-32" [_thread_in_native, id=783, stack(0x00007f24c5ad4000,0x00007f24c5bd5000)]
  0x00007f2554036a00 JavaThread "reactor-nio-1-thread-33" [_thread_in_native, id=785, stack(0x00007f24c5858000,0x00007f24c5959000)]
  0x00007f2534060260 JavaThread "reactor-nio-1-thread-34" [_thread_in_native, id=787, stack(0x00007f24c5574000,0x00007f24c5675000)]
  0x00007f27300dcc20 JavaThread "opendistro_job_sweeper[T#1]" daemon [_thread_blocked, id=788, stack(0x00007f24c1492000,0x00007f24c1593000)]
  0x00007f26a04b65f0 JavaThread "DefaultDispatcher-worker-1" daemon [_thread_blocked, id=789, stack(0x00007f24bff75000,0x00007f24c0076000)]
  0x00007f26a04b7bd0 JavaThread "DefaultDispatcher-worker-2" daemon [_thread_blocked, id=790, stack(0x00007f24bfe74000,0x00007f24bff75000)]
  0x00007f259803a4e0 JavaThread "boundedElastic-43" daemon [_thread_blocked, id=804, stack(0x00007f24c5eae000,0x00007f24c5faf000)]
  0x00007f2568043430 JavaThread "boundedElastic-44" daemon [_thread_blocked, id=805, stack(0x00007f2608dde000,0x00007f2608edf000)]
  0x00007f25a0043ca0 JavaThread "boundedElastic-42" daemon [_thread_blocked, id=806, stack(0x00007f250ebf3000,0x00007f250ecf4000)]
  0x00007f259002f930 JavaThread "boundedElastic-45" daemon [_thread_blocked, id=807, stack(0x00007f24dee44000,0x00007f24def45000)]
  0x00007f259402ca60 JavaThread "boundedElastic-46" daemon [_thread_blocked, id=808, stack(0x00007f260acfd000,0x00007f260adfe000)]

Other Threads:
  0x00007f28191292e0 VMThread "VM Thread" [stack: 0x00007f27acf00000,0x00007f27ad000000] [id=271] _threads_hazard_ptr=0x00007f25d042b850
  0x00007f281a36f710 WatcherThread [stack: 0x00007f27276f8000,0x00007f27277f8000] [id=296]
  0x00007f2818060f50 GCTaskThread "GC Thread#0" [stack: 0x00007f281d24f000,0x00007f281d34f000] [id=252]
  0x00007f28180b3770 GCTaskThread "GC Thread#1" [stack: 0x00007f281c03a000,0x00007f281c13a000] [id=255]
  0x00007f28180b4560 GCTaskThread "GC Thread#2" [stack: 0x00007f27bcacc000,0x00007f27bcbcc000] [id=256]
  0x00007f28180b5350 GCTaskThread "GC Thread#3" [stack: 0x00007f27bc9ca000,0x00007f27bcaca000] [id=257]
  0x00007f28180b61b0 GCTaskThread "GC Thread#4" [stack: 0x00007f27bc8c8000,0x00007f27bc9c8000] [id=258]
  0x00007f28180b7010 GCTaskThread "GC Thread#5" [stack: 0x00007f27bc7c6000,0x00007f27bc8c6000] [id=259]
  0x00007f28180b7e70 GCTaskThread "GC Thread#6" [stack: 0x00007f27bc6c4000,0x00007f27bc7c4000] [id=260]
  0x00007f28180b8cd0 GCTaskThread "GC Thread#7" [stack: 0x00007f27bc5c2000,0x00007f27bc6c2000] [id=261]
  0x00007f28180b9b30 GCTaskThread "GC Thread#8" [stack: 0x00007f27bc4c0000,0x00007f27bc5c0000] [id=262]
  0x00007f28180ba990 GCTaskThread "GC Thread#9" [stack: 0x00007f27bc3be000,0x00007f27bc4be000] [id=263]
  0x00007f28180bb7f0 GCTaskThread "GC Thread#10" [stack: 0x00007f27bc2bc000,0x00007f27bc3bc000] [id=264]
  0x00007f28180bc650 GCTaskThread "GC Thread#11" [stack: 0x00007f27bc1ba000,0x00007f27bc2ba000] [id=265]
  0x00007f28180bd4b0 GCTaskThread "GC Thread#12" [stack: 0x00007f27bc0b8000,0x00007f27bc1b8000] [id=266]
  0x00007f28180be310 GCTaskThread "GC Thread#13" [stack: 0x00007f27adf00000,0x00007f27ae000000] [id=267]
  0x00007f28180bf170 GCTaskThread "GC Thread#14" [stack: 0x00007f27addfe000,0x00007f27adefe000] [id=268]
  0x00007f2818071510 ConcurrentGCThread "G1 Main Marker" [stack: 0x00007f281d14d000,0x00007f281d24d000] [id=253]
  0x00007f2818072410 ConcurrentGCThread "G1 Conc#0" [stack: 0x00007f281d04b000,0x00007f281d14b000] [id=254]
  0x00007f27b0000cf0 ConcurrentGCThread "G1 Conc#1" [stack: 0x00007f27274f5000,0x00007f27275f5000] [id=298]
  0x00007f27b0001740 ConcurrentGCThread "G1 Conc#2" [stack: 0x00007f2726b00000,0x00007f2726c00000] [id=299]
  0x00007f27b00021c0 ConcurrentGCThread "G1 Conc#3" [stack: 0x00007f27269fe000,0x00007f2726afe000] [id=300]
  0x00007f28190faf60 ConcurrentGCThread "G1 Refine#0" [stack: 0x00007f27adcfc000,0x00007f27addfc000] [id=269]
  0x00007f2774008180 ConcurrentGCThread "G1 Refine#1" [stack: 0x00007f26965e7000,0x00007f26966e7000] [id=421]
  0x00007f262406d630 ConcurrentGCThread "G1 Refine#2" [stack: 0x00007f2515a53000,0x00007f2515b53000] [id=618]
  0x00007f26cc00cae0 ConcurrentGCThread "G1 Refine#3" [stack: 0x00007f2515951000,0x00007f2515a51000] [id=619]
  0x00007f26e80203d0 ConcurrentGCThread "G1 Refine#4" [stack: 0x00007f251584d000,0x00007f251594d000] [id=620]
  0x00007f267c04b260 ConcurrentGCThread "G1 Refine#5" [stack: 0x00007f2515749000,0x00007f2515849000] [id=621]
  0x00007f2688124100 ConcurrentGCThread "G1 Refine#6" [stack: 0x00007f2511911000,0x00007f2511a11000] [id=633]
  0x00007f26b8001310 ConcurrentGCThread "G1 Refine#7" [stack: 0x00007f250bbd7000,0x00007f250bcd7000] [id=650]
  0x00007f2709109980 ConcurrentGCThread "G1 Refine#8" [stack: 0x00007f250bad3000,0x00007f250bbd3000] [id=651]
  0x00007f2704009b00 ConcurrentGCThread "G1 Refine#9" [stack: 0x00007f250b9cd000,0x00007f250bacd000] [id=652]
  0x00007f2710008270 ConcurrentGCThread "G1 Refine#10" [stack: 0x00007f250b8c1000,0x00007f250b9c1000] [id=653]
  0x00007f270c01e240 ConcurrentGCThread "G1 Refine#11" [stack: 0x00007f250b7bf000,0x00007f250b8bf000] [id=654]
  0x00007f27180056d0 ConcurrentGCThread "G1 Refine#12" [stack: 0x00007f250b6b9000,0x00007f250b7b9000] [id=655]
  0x00007f2714001410 ConcurrentGCThread "G1 Refine#13" [stack: 0x00007f250b5b1000,0x00007f250b6b1000] [id=656]
  0x00007f271c0053b0 ConcurrentGCThread "G1 Refine#14" [stack: 0x00007f250b4a7000,0x00007f250b5a7000] [id=657]
  0x00007f28190fbe50 ConcurrentGCThread "G1 Service" [stack: 0x00007f27adbfa000,0x00007f27adcfa000] [id=270]

Threads with active compile tasks:
C2 CompilerThread0   620802 34467       4       org.datadog.jmxfetch.JmxSimpleAttribute::match (70 bytes)
C1 CompilerThread0   620802 34468       3       org.datadog.jmxfetch.JmxAttribute::getBeanParametersList (302 bytes)

VM state: synchronizing (normal execution)

VM Mutex/Monitor currently owned by a thread:  ([mutex/lock_event])
[0x00007f281801bf60] CodeCache_lock - owner thread: 0x00007f2819138670
[0x00007f28180280b0] Threads_lock - owner thread: 0x00007f28191292e0
[0x00007f2818028860] Heap_lock - owner thread: 0x00007f26696c37a0
[0x00007f2818029000] Compile_lock - owner thread: 0x00007f2819138670
[0x00007f2818029210] MethodCompileQueue_lock - owner thread: 0x00007f2819138670

OutOfMemory and StackOverflow Exception counts:
OutOfMemoryError java_heap_errors=3
LinkageErrors=3645

Heap address: 0x0000000080000000, size: 30720 MB, Compressed Oops mode: Zero based, Oop shift amount: 3

CDS archive(s) mapped at: [0x0000000800000000-0x0000000800a7b000-0x0000000800a7b000), size 10989568, SharedBaseAddress: 0x0000000800000000, ArchiveRelocationMode: 0.
Compressed class space mapped at: 0x0000000800c00000-0x0000000840c00000, reserved size: 1073741824
Narrow klass base: 0x0000000800000000, Narrow klass shift: 0, Narrow klass range: 0x100000000

GC Precious Log:
 CPUs: 20 total, 20 available
 Memory: 157G
 Large Page Support: Disabled
 NUMA Support: Disabled
 Compressed Oops: Enabled (Zero based)
 Heap Region Size: 16M
 Heap Min Capacity: 30G
 Heap Initial Capacity: 30G
 Heap Max Capacity: 30G
 Pre-touch: Enabled
 Parallel Workers: 15
 Concurrent Workers: 4
 Concurrent Refinement Workers: 15
 Periodic GC: Disabled

Heap:
 garbage-first heap   total 31457280K, used 8678319K [0x0000000080000000, 0x0000000800000000)
  region size 16384K, 6 young (98304K), 0 survivors (0K)
 Metaspace       used 156649K, committed 158272K, reserved 1187840K
  class space    used 19616K, committed 20352K, reserved 1048576K

dgilling · 2023-03-03T09:37:58Z

@andrross @kartg Did some more testing given discussion on potentially too many file descriptors from the remote cache. After a second pass at crash report, we did see over 200k opened to remote store related files.

We changed DEFAULT_BLOCK_SIZE_SHIFT to 24 for 16MB block size (8MB still caused crashes)
We also changed cache capacity to be .005 of the disk (It's a 8TB disk)

Since not too familiar with the remote cache code, let me know if those are not the right things to look at.

While it seemed to help prevent the crash, it uncovered some new exceptions:

    | Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.lucene.store.IndexInput.clone()" because "this.luceneIndexInput" is null
es-cold-search_1      | 	at org.opensearch.index.store.remote.filecache.FileCachedIndexInput.clone(FileCachedIndexInput.java:131) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.filecache.FileCachedIndexInput.clone(FileCachedIndexInput.java:26) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.utils.TransferManager.fetchBlob(TransferManager.java:56) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.fetchBlock(OnDemandBlockSnapshotIndexInput.java:148) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.demandBlock(OnDemandBlockIndexInput.java:347) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.seekInternal(OnDemandBlockIndexInput.java:318) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.seek(OnDemandBlockIndexInput.java:216) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.seek(OnDemandBlockSnapshotIndexInput.java:28) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockIndexInput.readByte(OnDemandBlockIndexInput.java:157) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.remote.file.OnDemandBlockSnapshotIndexInput.readByte(OnDemandBlockSnapshotIndexInput.java:28) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.apache.lucene.codecs.CodecUtil.readBEInt(CodecUtil.java:667) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:184) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:253) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.<init>(Lucene90BlockTreeTermsReader.java:128) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.fieldsProducer(Lucene90PostingsFormat.java:427) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:330) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:392) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:118) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:92) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:94) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:77) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:768) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:146) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
es-cold-search_1      | 	at org.opensearch.common.lucene.Lucene.readSegmentInfosExtendedCompatibility(Lucene.java:183) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.Store.readSegmentInfosExtendedCompatibility(Store.java:266) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:225) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:511) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:113) ~[opensearch-2.6.0.jar:2.6.0]
es-cold-search_1      | 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-2.6.0.jar:2.6.0]

andrross · 2023-03-03T16:39:17Z

Thanks @dgilling! We're clearly missing some limits for cases like yours. You did make the correct changes (if you're able to test with a 2.7 snapshot build, the configuration setting like node.search.cache.size: 40GB is available). We also have on the backlog to make that block size configurable.

We had a similar issue with the clone bug you just pointed out but it looks like it is not completely fixed. I'll dig into that now.

andrross · 2023-03-03T22:50:51Z

@dgilling It would also be interesting to see what kind of pressure the file cache is under during your testing, if at all possible. You can get snapshots of the file cache stats by doing GET _nodes/stats/file_cache:

{
    "_nodes": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "cluster_name": "opensearch",
    "nodes": {
        "XHirByVjTa-eAdliKVoSqg": {
            "file_cache": {
                "timestamp": 1677883656634,
                "active_in_bytes": 0,
                "total_in_bytes": 10737418240,
                "used_in_bytes": 0,
                "evicted_in_bytes": 0,
                "removed_in_bytes": 0,
                "replaced_count": 0,
                "active_percent": 0,
                "used_percent": 0,
                "cache_hits": 0,
                "cache_miss": 0
            }
        }
    }
}

dgilling · 2023-03-05T19:26:30Z

@andrross sure, let me know when you're able to push a fix for Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.lucene.store.IndexInput.clone()" because "this.luceneIndexInput" is null and we can rerun/get some stats.

The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the IndexInput was cloned (and therefore had its reference count incremented). Since the IndexInputs are stateful this is very bad. The least-recently-used semantics meant that in a properly-configured system this would be unlikely since accessing a zero-reference count item would move it to be most-recently used and therefore least likely to be evicted. However, there was still a latent bug that was possible to encounter (see issue opensearch-project#6295). The only way to fix this, as far as I can see, is to change the cache behavior so that fetching an item from the cache atomically increments its reference count. This also led to a change to TransferManager to ensure that all requests for an item ultimately read through the cache to eliminate any possibility of a race. I have implement some concurrent unit tests that put the cache into a worst-case thrashing scenario to ensure that concurrent access never closes an IndexInput while it is still being used. Signed-off-by: Andrew Ross <andrross@amazon.com>

The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the IndexInput was cloned (and therefore had its reference count incremented). Since the IndexInputs are stateful this is very bad. The least-recently-used semantics meant that in a properly-configured system this would be unlikely since accessing a zero-reference count item would move it to be most-recently used and therefore least likely to be evicted. However, there was still a latent bug that was possible to encounter (see issue #6295). The only way to fix this, as far as I can see, is to change the cache behavior so that fetching an item from the cache atomically increments its reference count. This also led to a change to TransferManager to ensure that all requests for an item ultimately read through the cache to eliminate any possibility of a race. I have implement some concurrent unit tests that put the cache into a worst-case thrashing scenario to ensure that concurrent access never closes an IndexInput while it is still being used. Signed-off-by: Andrew Ross <andrross@amazon.com>

The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the IndexInput was cloned (and therefore had its reference count incremented). Since the IndexInputs are stateful this is very bad. The least-recently-used semantics meant that in a properly-configured system this would be unlikely since accessing a zero-reference count item would move it to be most-recently used and therefore least likely to be evicted. However, there was still a latent bug that was possible to encounter (see issue #6295). The only way to fix this, as far as I can see, is to change the cache behavior so that fetching an item from the cache atomically increments its reference count. This also led to a change to TransferManager to ensure that all requests for an item ultimately read through the cache to eliminate any possibility of a race. I have implement some concurrent unit tests that put the cache into a worst-case thrashing scenario to ensure that concurrent access never closes an IndexInput while it is still being used. Signed-off-by: Andrew Ross <andrross@amazon.com> (cherry picked from commit d139ebc) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the IndexInput was cloned (and therefore had its reference count incremented). Since the IndexInputs are stateful this is very bad. The least-recently-used semantics meant that in a properly-configured system this would be unlikely since accessing a zero-reference count item would move it to be most-recently used and therefore least likely to be evicted. However, there was still a latent bug that was possible to encounter (see issue #6295). The only way to fix this, as far as I can see, is to change the cache behavior so that fetching an item from the cache atomically increments its reference count. This also led to a change to TransferManager to ensure that all requests for an item ultimately read through the cache to eliminate any possibility of a race. I have implement some concurrent unit tests that put the cache into a worst-case thrashing scenario to ensure that concurrent access never closes an IndexInput while it is still being used. (cherry picked from commit d139ebc) Signed-off-by: Andrew Ross <andrross@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

kotwanikunal · 2023-03-11T17:55:14Z

@dgilling Hey Derric!
The fix for concurrency issues has been merged in (#6592) and backported to 2.x (#6630).
I think this should resolve a bunch of issues with your tests, if you can give it a try.

dgilling · 2023-03-12T18:45:39Z

@kotwanikunal @andrross , looks like that did the trick. No more file descriptor explosion. Much appreciated. Will monitor and report on stats as we get more data

…t#6592) The previous implementation had an inherent race condition where a zero-reference count IndexInput read from the cache could be evicted before the IndexInput was cloned (and therefore had its reference count incremented). Since the IndexInputs are stateful this is very bad. The least-recently-used semantics meant that in a properly-configured system this would be unlikely since accessing a zero-reference count item would move it to be most-recently used and therefore least likely to be evicted. However, there was still a latent bug that was possible to encounter (see issue opensearch-project#6295). The only way to fix this, as far as I can see, is to change the cache behavior so that fetching an item from the cache atomically increments its reference count. This also led to a change to TransferManager to ensure that all requests for an item ultimately read through the cache to eliminate any possibility of a race. I have implement some concurrent unit tests that put the cache into a worst-case thrashing scenario to ensure that concurrent access never closes an IndexInput while it is still being used. Signed-off-by: Andrew Ross <andrross@amazon.com> Signed-off-by: Mingshi Liu <mingshl@amazon.com>

dgilling added bug Something isn't working untriaged labels Feb 13, 2023

anasalkouz added distributed framework and removed untriaged labels Feb 14, 2023

anasalkouz added this to Searchable Snapshots Feb 14, 2023

github-project-automation bot moved this to Todo in Searchable Snapshots Feb 14, 2023

dgilling changed the title ~~[BUG] Searchable Snapshot: Search hangs when terms agg on nested field~~ [BUG] Searchable Snapshot: Search hangs when parallel searches to same remote index Feb 21, 2023

andrross mentioned this issue Feb 22, 2023

[BUG] [Searchable Snapshots] Potential deadlock in ConcurrentInvocationLinearizer #6437

Closed

dgilling closed this as completed Mar 2, 2023

github-project-automation bot moved this from Todo to Done in Searchable Snapshots Mar 2, 2023

dgilling reopened this Mar 2, 2023

github-project-automation bot moved this from Done to In Progress in Searchable Snapshots Mar 2, 2023

github-actions bot added the untriaged label Mar 2, 2023

anasalkouz removed the untriaged label Mar 2, 2023

dblock mentioned this issue Mar 2, 2023

[BUG] Untriaged issues older than X look at the date the issue was created, not when the untriaged label was (re)added opensearch-project/project-tools#55

Open

anasalkouz assigned andrross Mar 2, 2023

This was referenced Mar 3, 2023

[Searchable Snapshot] Assess File System caching impact on heap memory #6227

Closed

[BUG] [Searchable Snapshot] Index input clone races with close on cache eviction #6536

Closed

andrross mentioned this issue Mar 9, 2023

Fix race with eviction when reading from FileCache #6592

Merged

6 tasks

dgilling closed this as completed Mar 12, 2023

github-project-automation bot moved this from In Progress to Done in Searchable Snapshots Mar 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Searchable Snapshot: Search hangs when parallel searches to same remote index #6295

[BUG] Searchable Snapshot: Search hangs when parallel searches to same remote index #6295

dgilling commented Feb 13, 2023 •

edited

Loading

dgilling commented Feb 18, 2023

andrross commented Feb 21, 2023

dgilling commented Feb 21, 2023 •

edited

Loading

dgilling commented Feb 21, 2023

dgilling commented Feb 21, 2023

andrross commented Feb 23, 2023

kartg commented Feb 23, 2023

andrross commented Feb 23, 2023

dgilling commented Feb 24, 2023

dgilling commented Mar 2, 2023 •

edited

Loading

dgilling commented Mar 3, 2023

andrross commented Mar 3, 2023

andrross commented Mar 3, 2023

dgilling commented Mar 5, 2023 •

edited

Loading

kotwanikunal commented Mar 11, 2023

dgilling commented Mar 12, 2023

[BUG] Searchable Snapshot: Search hangs when parallel searches to same remote index #6295

[BUG] Searchable Snapshot: Search hangs when parallel searches to same remote index #6295

Comments

dgilling commented Feb 13, 2023 • edited Loading

dgilling commented Feb 18, 2023

andrross commented Feb 21, 2023

dgilling commented Feb 21, 2023 • edited Loading

dgilling commented Feb 21, 2023

dgilling commented Feb 21, 2023

andrross commented Feb 23, 2023

kartg commented Feb 23, 2023

andrross commented Feb 23, 2023

dgilling commented Feb 24, 2023

dgilling commented Mar 2, 2023 • edited Loading

dgilling commented Mar 3, 2023

andrross commented Mar 3, 2023

andrross commented Mar 3, 2023

dgilling commented Mar 5, 2023 • edited Loading

kotwanikunal commented Mar 11, 2023

dgilling commented Mar 12, 2023

dgilling commented Feb 13, 2023 •

edited

Loading

dgilling commented Feb 21, 2023 •

edited

Loading

dgilling commented Mar 2, 2023 •

edited

Loading

dgilling commented Mar 5, 2023 •

edited

Loading