You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem description:
A few days ago, an active masternode process of the cluster hung up. Looking at the log, it was found that a datanode disk was damaged. The log is as follows:
[2021-08-09T09:17:36,992][WARN ][o.e.g.G.InternalPrimaryShardAllocator][node1] [index_2021-08-07][3]: failed to list shard for shard_started on node [Jyd9FdobRayRSGEdQla3Ww]
org.elasticsearch.action.FailedNodeException: Failed node [Jyd9FdobRayRSGEdQla3Ww]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:221) [elasticsearch-7.6.0.jar:7.6.0]
...
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [node1][0.0.0.0:9300][internal:gateway/local/started_shards[n]]
Caused by: org.elasticsearch.ElasticsearchException: failed to load started shards
at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:168) ~[elasticsearch-7.6.0.jar:7.6.0]
...
at java.lang.Thread.run(Thread.java:834) ~[?:?]
Caused by: java.lang.IllegalStateException: environment is not locked
at org.elasticsearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1044) ~[elasticsearch-7.6.0.jar:7.6.0]
...
at java.lang.Thread.run(Thread.java:834) ~[?:?]
Caused by: java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcherImpl.size0(Native Method) ~[?:?]
...
at java.lang.Thread.run(Thread.java:834) ~[?:?]
[2021-08-09T09:17:37,291][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler][node1] fatal error in thread [elasticsearch[node1][masterService#updateTask][T#436]], exiting
java.lang.StackOverflowError: null
at java.util.Collections$UnmodifiableCollection$1.(Collections.java:1042) ~[?:?]
at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1041) ~[?:?]
at java.util.Collections$UnmodifiableCollection$1.(Collections.java:1042) ~[?:?]
...
at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1041) ~[?:?]
at java.util.Collections$UnmodifiableCollection$1.(Collections.java:1042) ~[?:?]
at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1041) ~[?:?]
Here is the code for the UnmodifiableCollection class from Java.
As you can see, when you call iterator to an immutable collection, it creates an instance on the anonymous class. The constructor of this class calls C. iterator ()... Where is the class wrapped in C. However, a stack trace means that it C itself is an immutable collection.
So I can think of a reasonable reason:
If your application is wrapping unmodifiable collections in unmodifiable collections to N levels, then creating an iterator will result in N * 2 levels of stack frames. For large enough N, that would lead to a stack overflow.
static class UnmodifiableCollection<E> implements Collection<E>, Serializable {
final Collection<? extends E> c;
UnmodifiableCollection(Collection<? extends E> c) {
if (c==null)
throw new NullPointerException();
this.c = c;
}
...
public Iterator<E> iterator() {
return new Iterator<E>() {
private final Iterator<? extends E> i = c.iterator();
public boolean hasNext() {return i.hasNext();}
public E next() {return i.next();}
public void remove() {
throw new UnsupportedOperationException();
}
@Override
public void forEachRemaining(Consumer<? super E> action) {
// Use backing collection version
i.forEachRemaining(action);
}
};
}
}
Cause of problem:
Use Arthas to observe the calling path of the iterator method. The command is as follows:
The following results will appear only when the process has just started and the above command is executed immediately using Arthas.
ts=2021-08-13 10:10:53;thread_name=elasticsearch[node1][masterService#updateTask][T#1];id=22;is_daemon=true;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@277050dc
@java.util.Collections$UnmodifiableCollection.iterator()
at java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
...
at java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
at org.elasticsearch.common.io.stream.StreamOutput.writeCollection(StreamOutput.java:1112)
at org.elasticsearch.cluster.routing.**UnassignedInfo.writeTo(**UnassignedInfo.java:297)
at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:897)
at org.elasticsearch.cluster.routing.ShardRouting.writeToThin(ShardRouting.java:299)
at org.elasticsearch.cluster.routing.IndexShardRoutingTable$Builder.writeToThin(IndexShardRoutingTable.java:742)
at org.elasticsearch.cluster.routing.IndexRoutingTable.writeTo(IndexRoutingTable.java:321)
at org.elasticsearch.cluster.AbstractDiffable$CompleteDiff.writeTo(AbstractDiffable.java:81)
at org.elasticsearch.cluster.DiffableUtils$DiffableValueSerializer.writeDiff(DiffableUtils.java:647)
...
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:175)
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:253)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633)
Read the source code. Because the datanode disk is damaged, in active masternode BaseGatewayShardAllocator#makeAllocationDecision() method will return AllocateunassignedDecision. no (...), and this shard will be removeAndIgnore. When changing the reason why the shard is not allocated, call UnassignedInfo#getFailedNodeIds() to obtain an immutable collection, In the construction method of UnassignedInfo, a layer of immutable collection failedNodeIds will be wrapped, which cannot be changed.
--GatewayAllocator
------GatewayAllocator#allocateUnassigned()
----------GatewayAllocator#innerAllocatedUnassigned()
--------------BaseGatewayShardAllocator#allocateUnassigned()
------------------UnassignedIterator#removeAndIgnore()
----------------------UnassignedShards#ignoreShard() #currInfo.getFailedNodeIds() will get an immutable collection
--------------------------new UnassignedInfo() #this.failedNodeIds = Collections.unmodifiableSet(failedNodeIds)
Therefore, when the disk is damaged and does not come out for a period of time, it will cause stack overflow.
The text was updated successfully, but these errors were encountered:
We kept wrapping the collection over and over again which in extreme corner cases could lead to a SOE.
Closes#76490
Co-authored-by: hanbj <hanbj0707@163.com>
#76480
Problem description:
A few days ago, an active masternode process of the cluster hung up. Looking at the log, it was found that a datanode disk was damaged. The log is as follows:
Here is the code for the UnmodifiableCollection class from Java.
As you can see, when you call iterator to an immutable collection, it creates an instance on the anonymous class. The constructor of this class calls C. iterator ()... Where is the class wrapped in C. However, a stack trace means that it C itself is an immutable collection.
So I can think of a reasonable reason:
If your application is wrapping unmodifiable collections in unmodifiable collections to N levels, then creating an iterator will result in N * 2 levels of stack frames. For large enough N, that would lead to a stack overflow.
Cause of problem:
Use Arthas to observe the calling path of the iterator method. The command is as follows:
The following results will appear only when the process has just started and the above command is executed immediately using Arthas.
Read the source code. Because the datanode disk is damaged, in active masternode BaseGatewayShardAllocator#makeAllocationDecision() method will return AllocateunassignedDecision. no (...), and this shard will be removeAndIgnore. When changing the reason why the shard is not allocated, call UnassignedInfo#getFailedNodeIds() to obtain an immutable collection, In the construction method of UnassignedInfo, a layer of immutable collection failedNodeIds will be wrapped, which cannot be changed.
Therefore, when the disk is damaged and does not come out for a period of time, it will cause stack overflow.
The text was updated successfully, but these errors were encountered: