ArrayIndexOutOfBoundsException on attempt to reconnect #644

igor-komar · 2022-05-16T08:15:44Z

Defect

We observed one of our Nats clients in our production enviroment flooded log with the following exceptions.

First 40 minutes it was throwing the following exception during publish() call (messages consumption was healthy):

java.lang.IllegalStateException: Output queue is full 5000
	at io.nats.client.impl.MessageQueue.push(MessageQueue.java:124)
	at io.nats.client.impl.MessageQueue.push(MessageQueue.java:110)
	at io.nats.client.impl.NatsConnectionWriter.queue(NatsConnectionWriter.java:209)
	at io.nats.client.impl.NatsConnection.queueOutgoing(NatsConnection.java:1353)
	at io.nats.client.impl.NatsConnection.publishInternal(NatsConnection.java:765)
	at io.nats.client.impl.NatsConnection.publish(NatsConnection.java:733)

Then both publishing and consumption didn't work for another 40 minutes till service was terminated. At this time log was flooded with the following exceptions (each time with the same last destination index and byte index):

java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 175759 out of bounds for byte[131555]
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
	at io.nats.client.impl.NatsConnection.tryToConnect(NatsConnection.java:380)
	at io.nats.client.impl.NatsConnection.reconnect(NatsConnection.java:254)
	at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:583)
	at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$3(NatsConnection.java:541)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 175759 out of bounds for byte[131555]
	at java.base/java.lang.System.arraycopy(Native Method)
	at io.nats.client.impl.NatsConnectionWriter.sendMessageBatch(NatsConnectionWriter.java:147)
	at io.nats.client.impl.NatsConnectionWriter.run(NatsConnectionWriter.java:188)
	... 5 common frames omitted

Our client configuration:

        Options options = new Options.Builder()
                .servers(natsBootstrapServers.addresses)
                .errorListener(errorListener)
                .connectionTimeout(Duration.ofSeconds(5))
                .maxReconnects(-1) // -1 = Infinite
                .connectionName(getHostName())
                .pingInterval(Duration.ofSeconds(2))
                .build();

We don't have a reproduction scenario yet.

Versions of `io.nats:jnats` and `nats-server`:

nats server v 2.7.4
jnats v 2.14.1

OS/Container environment:

adoptopenjdk/openjdk11:jdk-11.0.8_10-alpine

Steps or code to reproduce the issue:

Expected result:

Actual result:

The text was updated successfully, but these errors were encountered:

igor-komar · 2022-05-16T08:39:24Z

Linux core version 5.4.170

scottf · 2022-06-08T14:02:30Z

@igor-komar You say "message consumption was healthy", what does this mean? If the output queue is full, it seems that it's having difficulty sending those messages to the server in a timely fashion. How big are your message payloads?

igor-komar · 2022-06-08T15:15:01Z

@scottf "message consumption was healthy" means that messages receiving was NOT interrupted on the service, only publishing didn't work.
Incoming/outgoing messages have average size of ~1KB. Max size is not metered, I think it's below 50KB.

scottf · 2022-06-08T16:28:26Z

Is the consumption on the same connection as the production? I'm trying to understand why you noted consumption. I recognize that something bad is happening when publish is stopped because of a full output queue, but am really trying to understand how the queue got full in the first place.

igor-komar · 2022-06-08T17:14:36Z

Yes, consumption is on the same connection as the production.

but am really trying to understand how the queue got full in the first place.

It was the problem of exactly this affected instance. Other instances of the same service were consuming and publishing
without problems. I guess this might be a result of race condition.

michelwilson · 2022-06-17T06:56:27Z

I also encountered this exact issue. I did not manage to capture a stack trace, somehow it got swallowed, not sure why. First, we get the queue full error:
17-Jun-2022 03:35:39.473 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 23, Exception: java.io.IOException: java.lang.IllegalStateException: Output queue is full 5000
And right after that
17-Jun-2022 03:35:39.489 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 23, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65654 out of bounds for byte[65536]
And this last error seems to be repeated continuously, until I restart the application.

Relevant version numbers:

jnats 2.14.0
server 2.8.0

Edit: also relevant of course, the client was publishing to an on-disk jetstream, with this config:

val config = StreamConfiguration.builder()
            .name(stream)
            .subjects(subjects)
            .retentionPolicy(RetentionPolicy.Limits)
            .maxAge(maxAge)
            .storageType(StorageType.File)
            .replicas(3)
            .build()

Publishing was done using publishAsync.

MauriceVanVeen · 2022-09-26T08:52:54Z

@scottf, I have been looking into the code and I believe I've found what goes wrong in NatsConnectionWriter.sendMessageBatch (related to the ArrayIndexOutOfBoundsException):

The exception gets thrown when the sendBuffer has a smaller size, so the current message doesn't fit.
In the logic for resizing the sendBuffer, there is a check if it's the first item, if so the buffer is re-sized. Otherwise the buffer is sent, and we reset the sendPosition.

However, I believe two things go wrong for the following piece of code in NatsConnectionWriter.sendMessageBatch:

while (msg != null) {
    long size = msg.getSizeInBytes();
    
    if (sendPosition + size > sendBuffer.length) {
        if (sendPosition == 0) { // have to resize
            this.sendBuffer = new byte[(int)Math.max(sendBuffer.length + size, sendBuffer.length * 2L)];
        } else { // else send and go to next message
            dataPort.write(sendBuffer, sendPosition);
            connection.getNatsStatistics().registerWrite(sendPosition);
            sendPosition = 0;
            msg = msg.next;
    
            if (msg == null) {
                break;
            }
        }
    }
    
    // ... use System.arraycopy, etc.

    msg = msg.next;
}

if msg == null we break out of the loop (which is correct), but if there is a message than the resizing of the sendBuffer isn't checked. That will cause the System.arraycopy(...) to throw an error, since the buffer can't accommodate the size of the new message. Changing this into a continue instead of the if + break would work, then the resizing check would happen for this message and this should be fixed.

Something that I find a bit curious though, why the msg = msg.next; is happening. Let me explain my train of thought:

There is a msg and we get its size.
This message has sendPosition != 0 and the buffer doesn't have enough space.
We can't resize the buffer currently, since there is still data in there. So we write the buffer and record the stats.
The sendPosition is reset to 0, and we can now re-use the buffer.

But, why does the msg = msg.next happen, I believe we wouldn't send the current message since it isn't contained in the original sent buffer?

Using a continue and not overwriting the msg with the next should fix both of these issues. Do you agree with this, or am I missing something?

while (msg != null) {
    long size = msg.getSizeInBytes();

    if (sendPosition + size > sendBuffer.length) {
        if (sendPosition == 0) { // have to resize
            this.sendBuffer = new byte[(int)Math.max(sendBuffer.length + size, sendBuffer.length * 2L)];
        } else { // else send and go to next message
            dataPort.write(sendBuffer, sendPosition);
            connection.getNatsStatistics().registerWrite(sendPosition);
            sendPosition = 0;

            // go back to the start of the loop, to ensure we check for resizing the buffer again
            continue;
        }
    }

MauriceVanVeen · 2022-09-26T09:07:30Z

Created a PR for this in advance 🙂

scottf · 2022-09-27T20:29:21Z

Ok I think I see it now. We advance the msg to the next, but never send the one that would have been too big for the buffer. It probably doesn't happen much because of how often we run this code and the default buffer is plenty big, but on cases where the buffer was small or the message was really big, this can happen a lot.

MauriceVanVeen · 2022-09-27T20:35:13Z

Yeah, I believe changing this into the continue would solve this, by ensuring the re-size is checked again and that one message isn't lost.

scottf · 2022-09-29T16:03:11Z

@igor-komar @michelwilson There is a fix in the main branch and should be available in 2.16.1-SNAPSHOT on Sonatype

MauriceVanVeen · 2022-10-05T15:21:11Z

Small update from my end. Have not been able to reproduce this error reliably (if at all) on our setup.. Seems like it takes a really long time to get into this state, or is just "luck".
However, will update our jnats version to 2.16.1-SNAPSHOT soon. The issue should be fixed then.

MauriceVanVeen · 2022-10-11T18:13:59Z

Another update, we've updated to version 2.16.1 yesterday and are still encountering this issue sadly.

11-Oct-2022 18:07:46.413 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.415 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.416 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.417 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.418 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.420 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.420 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.422 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]

Will add the errorListener option, so we can inspect the full stacktrace.

scottf · 2022-10-11T20:59:31Z

I wonder if I should just switch to using the ByteArrayBuilder which just automatically expands and uses ByteBuffer under the covers.

MauriceVanVeen · 2022-10-12T20:20:37Z

@scottf Got a full stacktrace this time:

java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65643 out of bounds for byte[65536]
	at java.base/java.util.concurrent.FutureTask.report(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.get(Unknown Source)
	at io.nats.client.impl.NatsConnection.tryToConnect(NatsConnection.java:381)
	at io.nats.client.impl.NatsConnection.reconnect(NatsConnection.java:254)
	at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:584)
	at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$3(NatsConnection.java:542)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65643 out of bounds for byte[65536]
	at java.base/java.lang.System.arraycopy(Native Method)
	at io.nats.client.impl.NatsConnectionWriter.sendMessageBatch(NatsConnectionWriter.java:145)
	at io.nats.client.impl.NatsConnectionWriter.run(NatsConnectionWriter.java:186)
	... 5 more

However, before this it seems like messages can't be sent anymore at all:

java.util.concurrent.CompletionException: java.util.concurrent.CancellationException: Future cancelled, response not registered in time, likely due to server disconnect.
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
    at io.nats.client.support.NatsRequestCompletableFuture.cancelTimedOut(NatsRequestCompletableFuture.java:31)
    at io.nats.client.impl.NatsConnection.lambda$cleanResponses$6(NatsConnection.java:957)
    at java.base/java.util.concurrent.ConcurrentHashMap.forEach(Unknown Source)
    at io.nats.client.impl.NatsConnection.cleanResponses(NatsConnection.java:953)
    at io.nats.client.impl.NatsConnection$2.run(NatsConnection.java:455)
    at java.base/java.util.TimerThread.mainLoop(Unknown Source)
    at java.base/java.util.TimerThread.run(Unknown Source)
Caused by: java.util.concurrent.CancellationException: Future cancelled, response not registered in time, likely due to server disconnect.
... 7 common frames omitted

Not quite sure what's the root cause. It seems like something is going wrong even prior to the out of bounds.

scottf · 2022-10-12T20:30:02Z

What can I do do reproduce this?

MauriceVanVeen · 2022-10-12T20:32:19Z

Not sure how to reproduce it yet. Currently trying to dig through the logs to get more clarity.

Looking a bit more in the logs for a timeline:

UTC 19:36:57, messages start piling up
UTC 19:37:16, first sign of a CancellationException
UTC 19:43:11, first sign of a ArrayIndexOutOfBoundsException

MauriceVanVeen · 2022-10-12T20:35:35Z

Our way of "reproducing" this so far was running it for a day or two until it eventually goes into this state, but that isn't really a reliable way 😓

scottf · 2023-01-20T10:50:29Z

@igor-komar @michelwilson This issue has been fixed in PR #820 and released in 2.16.6
Thanks to @MauriceVanVeen The core issues was the lazy nature of the message size calculation was not thread-safe. Lazy calculation was replaced with calculation upon construction.

igor-komar added the 🐞 bug label May 16, 2022

scottf self-assigned this Jun 8, 2022

MauriceVanVeen mentioned this issue Sep 26, 2022

Ensure resizing of the sendBuffer #746

Merged

MauriceVanVeen mentioned this issue Sep 30, 2022

Sync NatsMessage.getSizeInBytes() with usage in NatsConnectionWriter #756

Merged

scottf closed this as completed Jan 20, 2023

bruth removed the 🐞 defect label Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArrayIndexOutOfBoundsException on attempt to reconnect #644

ArrayIndexOutOfBoundsException on attempt to reconnect #644

igor-komar commented May 16, 2022

igor-komar commented May 16, 2022

scottf commented Jun 8, 2022

igor-komar commented Jun 8, 2022

scottf commented Jun 8, 2022

igor-komar commented Jun 8, 2022

michelwilson commented Jun 17, 2022 •

edited

Loading

MauriceVanVeen commented Sep 26, 2022 •

edited

Loading

MauriceVanVeen commented Sep 26, 2022 •

edited

Loading

scottf commented Sep 27, 2022 •

edited

Loading

MauriceVanVeen commented Sep 27, 2022

scottf commented Sep 29, 2022 •

edited

Loading

MauriceVanVeen commented Oct 5, 2022

MauriceVanVeen commented Oct 11, 2022 •

edited

Loading

scottf commented Oct 11, 2022 •

edited

Loading

MauriceVanVeen commented Oct 12, 2022

scottf commented Oct 12, 2022

MauriceVanVeen commented Oct 12, 2022

MauriceVanVeen commented Oct 12, 2022

scottf commented Jan 20, 2023

ArrayIndexOutOfBoundsException on attempt to reconnect #644

ArrayIndexOutOfBoundsException on attempt to reconnect #644

Comments

igor-komar commented May 16, 2022

Defect

Versions of io.nats:jnats and nats-server:

OS/Container environment:

Steps or code to reproduce the issue:

Expected result:

Actual result:

igor-komar commented May 16, 2022

scottf commented Jun 8, 2022

igor-komar commented Jun 8, 2022

scottf commented Jun 8, 2022

igor-komar commented Jun 8, 2022

michelwilson commented Jun 17, 2022 • edited Loading

MauriceVanVeen commented Sep 26, 2022 • edited Loading

MauriceVanVeen commented Sep 26, 2022 • edited Loading

scottf commented Sep 27, 2022 • edited Loading

MauriceVanVeen commented Sep 27, 2022

scottf commented Sep 29, 2022 • edited Loading

MauriceVanVeen commented Oct 5, 2022

MauriceVanVeen commented Oct 11, 2022 • edited Loading

scottf commented Oct 11, 2022 • edited Loading

MauriceVanVeen commented Oct 12, 2022

scottf commented Oct 12, 2022

MauriceVanVeen commented Oct 12, 2022

MauriceVanVeen commented Oct 12, 2022

scottf commented Jan 20, 2023

Versions of `io.nats:jnats` and `nats-server`:

michelwilson commented Jun 17, 2022 •

edited

Loading

MauriceVanVeen commented Sep 26, 2022 •

edited

Loading

MauriceVanVeen commented Sep 26, 2022 •

edited

Loading

scottf commented Sep 27, 2022 •

edited

Loading

scottf commented Sep 29, 2022 •

edited

Loading

MauriceVanVeen commented Oct 11, 2022 •

edited

Loading

scottf commented Oct 11, 2022 •

edited

Loading