Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrayIndexOutOfBoundsException on attempt to reconnect #644

Closed
igor-komar opened this issue May 16, 2022 · 19 comments
Closed

ArrayIndexOutOfBoundsException on attempt to reconnect #644

igor-komar opened this issue May 16, 2022 · 19 comments
Assignees

Comments

@igor-komar
Copy link

Defect

We observed one of our Nats clients in our production enviroment flooded log with the following exceptions.

  1. First 40 minutes it was throwing the following exception during publish() call (messages consumption was healthy):
java.lang.IllegalStateException: Output queue is full 5000
	at io.nats.client.impl.MessageQueue.push(MessageQueue.java:124)
	at io.nats.client.impl.MessageQueue.push(MessageQueue.java:110)
	at io.nats.client.impl.NatsConnectionWriter.queue(NatsConnectionWriter.java:209)
	at io.nats.client.impl.NatsConnection.queueOutgoing(NatsConnection.java:1353)
	at io.nats.client.impl.NatsConnection.publishInternal(NatsConnection.java:765)
	at io.nats.client.impl.NatsConnection.publish(NatsConnection.java:733)
  1. Then both publishing and consumption didn't work for another 40 minutes till service was terminated. At this time log was flooded with the following exceptions (each time with the same last destination index and byte index):
java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 175759 out of bounds for byte[131555]
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
	at io.nats.client.impl.NatsConnection.tryToConnect(NatsConnection.java:380)
	at io.nats.client.impl.NatsConnection.reconnect(NatsConnection.java:254)
	at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:583)
	at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$3(NatsConnection.java:541)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 175759 out of bounds for byte[131555]
	at java.base/java.lang.System.arraycopy(Native Method)
	at io.nats.client.impl.NatsConnectionWriter.sendMessageBatch(NatsConnectionWriter.java:147)
	at io.nats.client.impl.NatsConnectionWriter.run(NatsConnectionWriter.java:188)
	... 5 common frames omitted

Our client configuration:

        Options options = new Options.Builder()
                .servers(natsBootstrapServers.addresses)
                .errorListener(errorListener)
                .connectionTimeout(Duration.ofSeconds(5))
                .maxReconnects(-1) // -1 = Infinite
                .connectionName(getHostName())
                .pingInterval(Duration.ofSeconds(2))
                .build();

We don't have a reproduction scenario yet.

Versions of io.nats:jnats and nats-server:

nats server v 2.7.4
jnats v 2.14.1

OS/Container environment:

adoptopenjdk/openjdk11:jdk-11.0.8_10-alpine

Steps or code to reproduce the issue:

Expected result:

Actual result:

@igor-komar
Copy link
Author

Linux core version 5.4.170

@scottf
Copy link
Contributor

scottf commented Jun 8, 2022

@igor-komar You say "message consumption was healthy", what does this mean? If the output queue is full, it seems that it's having difficulty sending those messages to the server in a timely fashion. How big are your message payloads?

@scottf scottf self-assigned this Jun 8, 2022
@igor-komar
Copy link
Author

@scottf "message consumption was healthy" means that messages receiving was NOT interrupted on the service, only publishing didn't work.
Incoming/outgoing messages have average size of ~1KB. Max size is not metered, I think it's below 50KB.

@scottf
Copy link
Contributor

scottf commented Jun 8, 2022

Is the consumption on the same connection as the production? I'm trying to understand why you noted consumption. I recognize that something bad is happening when publish is stopped because of a full output queue, but am really trying to understand how the queue got full in the first place.

@igor-komar
Copy link
Author

Yes, consumption is on the same connection as the production.

but am really trying to understand how the queue got full in the first place.

It was the problem of exactly this affected instance. Other instances of the same service were consuming and publishing
without problems. I guess this might be a result of race condition.

@michelwilson
Copy link

michelwilson commented Jun 17, 2022

I also encountered this exact issue. I did not manage to capture a stack trace, somehow it got swallowed, not sure why. First, we get the queue full error:
17-Jun-2022 03:35:39.473 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 23, Exception: java.io.IOException: java.lang.IllegalStateException: Output queue is full 5000
And right after that
17-Jun-2022 03:35:39.489 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 23, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65654 out of bounds for byte[65536]
And this last error seems to be repeated continuously, until I restart the application.

Relevant version numbers:

  • jnats 2.14.0
  • server 2.8.0

Edit: also relevant of course, the client was publishing to an on-disk jetstream, with this config:

val config = StreamConfiguration.builder()
            .name(stream)
            .subjects(subjects)
            .retentionPolicy(RetentionPolicy.Limits)
            .maxAge(maxAge)
            .storageType(StorageType.File)
            .replicas(3)
            .build()

Publishing was done using publishAsync.

@MauriceVanVeen
Copy link
Member

MauriceVanVeen commented Sep 26, 2022

@scottf, I have been looking into the code and I believe I've found what goes wrong in NatsConnectionWriter.sendMessageBatch (related to the ArrayIndexOutOfBoundsException):

  • The exception gets thrown when the sendBuffer has a smaller size, so the current message doesn't fit.
  • In the logic for resizing the sendBuffer, there is a check if it's the first item, if so the buffer is re-sized. Otherwise the buffer is sent, and we reset the sendPosition.

However, I believe two things go wrong for the following piece of code in NatsConnectionWriter.sendMessageBatch:

while (msg != null) {
    long size = msg.getSizeInBytes();
    
    if (sendPosition + size > sendBuffer.length) {
        if (sendPosition == 0) { // have to resize
            this.sendBuffer = new byte[(int)Math.max(sendBuffer.length + size, sendBuffer.length * 2L)];
        } else { // else send and go to next message
            dataPort.write(sendBuffer, sendPosition);
            connection.getNatsStatistics().registerWrite(sendPosition);
            sendPosition = 0;
            msg = msg.next;
    
            if (msg == null) {
                break;
            }
        }
    }
    
    // ... use System.arraycopy, etc.

    msg = msg.next;
}
  • if msg == null we break out of the loop (which is correct), but if there is a message than the resizing of the sendBuffer isn't checked. That will cause the System.arraycopy(...) to throw an error, since the buffer can't accommodate the size of the new message. Changing this into a continue instead of the if + break would work, then the resizing check would happen for this message and this should be fixed.

Something that I find a bit curious though, why the msg = msg.next; is happening. Let me explain my train of thought:

  • There is a msg and we get its size.
  • This message has sendPosition != 0 and the buffer doesn't have enough space.
  • We can't resize the buffer currently, since there is still data in there. So we write the buffer and record the stats.
  • The sendPosition is reset to 0, and we can now re-use the buffer.

But, why does the msg = msg.next happen, I believe we wouldn't send the current message since it isn't contained in the original sent buffer?

Using a continue and not overwriting the msg with the next should fix both of these issues. Do you agree with this, or am I missing something?

while (msg != null) {
    long size = msg.getSizeInBytes();

    if (sendPosition + size > sendBuffer.length) {
        if (sendPosition == 0) { // have to resize
            this.sendBuffer = new byte[(int)Math.max(sendBuffer.length + size, sendBuffer.length * 2L)];
        } else { // else send and go to next message
            dataPort.write(sendBuffer, sendPosition);
            connection.getNatsStatistics().registerWrite(sendPosition);
            sendPosition = 0;

            // go back to the start of the loop, to ensure we check for resizing the buffer again
            continue;
        }
    }

@MauriceVanVeen
Copy link
Member

MauriceVanVeen commented Sep 26, 2022

Created a PR for this in advance 🙂

@scottf
Copy link
Contributor

scottf commented Sep 27, 2022

Ok I think I see it now. We advance the msg to the next, but never send the one that would have been too big for the buffer. It probably doesn't happen much because of how often we run this code and the default buffer is plenty big, but on cases where the buffer was small or the message was really big, this can happen a lot.

@MauriceVanVeen
Copy link
Member

Yeah, I believe changing this into the continue would solve this, by ensuring the re-size is checked again and that one message isn't lost.

@scottf
Copy link
Contributor

scottf commented Sep 29, 2022

@igor-komar @michelwilson There is a fix in the main branch and should be available in 2.16.1-SNAPSHOT on Sonatype

@MauriceVanVeen
Copy link
Member

Small update from my end. Have not been able to reproduce this error reliably (if at all) on our setup.. Seems like it takes a really long time to get into this state, or is just "luck".
However, will update our jnats version to 2.16.1-SNAPSHOT soon. The issue should be fixed then.

@MauriceVanVeen
Copy link
Member

MauriceVanVeen commented Oct 11, 2022

Another update, we've updated to version 2.16.1 yesterday and are still encountering this issue sadly.

11-Oct-2022 18:07:46.413 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.415 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.416 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.417 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.418 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.420 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.420 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.422 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]

Will add the errorListener option, so we can inspect the full stacktrace.

@scottf
Copy link
Contributor

scottf commented Oct 11, 2022

I wonder if I should just switch to using the ByteArrayBuilder which just automatically expands and uses ByteBuffer under the covers.

@MauriceVanVeen
Copy link
Member

@scottf Got a full stacktrace this time:

java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65643 out of bounds for byte[65536]
	at java.base/java.util.concurrent.FutureTask.report(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.get(Unknown Source)
	at io.nats.client.impl.NatsConnection.tryToConnect(NatsConnection.java:381)
	at io.nats.client.impl.NatsConnection.reconnect(NatsConnection.java:254)
	at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:584)
	at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$3(NatsConnection.java:542)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65643 out of bounds for byte[65536]
	at java.base/java.lang.System.arraycopy(Native Method)
	at io.nats.client.impl.NatsConnectionWriter.sendMessageBatch(NatsConnectionWriter.java:145)
	at io.nats.client.impl.NatsConnectionWriter.run(NatsConnectionWriter.java:186)
	... 5 more

However, before this it seems like messages can't be sent anymore at all:

java.util.concurrent.CompletionException: java.util.concurrent.CancellationException: Future cancelled, response not registered in time, likely due to server disconnect.
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
    at io.nats.client.support.NatsRequestCompletableFuture.cancelTimedOut(NatsRequestCompletableFuture.java:31)
    at io.nats.client.impl.NatsConnection.lambda$cleanResponses$6(NatsConnection.java:957)
    at java.base/java.util.concurrent.ConcurrentHashMap.forEach(Unknown Source)
    at io.nats.client.impl.NatsConnection.cleanResponses(NatsConnection.java:953)
    at io.nats.client.impl.NatsConnection$2.run(NatsConnection.java:455)
    at java.base/java.util.TimerThread.mainLoop(Unknown Source)
    at java.base/java.util.TimerThread.run(Unknown Source)
Caused by: java.util.concurrent.CancellationException: Future cancelled, response not registered in time, likely due to server disconnect.
... 7 common frames omitted

Not quite sure what's the root cause. It seems like something is going wrong even prior to the out of bounds.

@scottf
Copy link
Contributor

scottf commented Oct 12, 2022

What can I do do reproduce this?

@MauriceVanVeen
Copy link
Member

Not sure how to reproduce it yet. Currently trying to dig through the logs to get more clarity.

Looking a bit more in the logs for a timeline:

  • UTC 19:36:57, messages start piling up
  • UTC 19:37:16, first sign of a CancellationException
  • UTC 19:43:11, first sign of a ArrayIndexOutOfBoundsException

@MauriceVanVeen
Copy link
Member

Our way of "reproducing" this so far was running it for a day or two until it eventually goes into this state, but that isn't really a reliable way 😓

@scottf
Copy link
Contributor

scottf commented Jan 20, 2023

@igor-komar @michelwilson This issue has been fixed in PR #820 and released in 2.16.6
Thanks to @MauriceVanVeen The core issues was the lazy nature of the message size calculation was not thread-safe. Lazy calculation was replaced with calculation upon construction.

@scottf scottf closed this as completed Jan 20, 2023
@bruth bruth removed the 🐞 defect label Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants