Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some of the queues are closed #669

Closed
zzl221000 opened this issue Apr 28, 2024 · 19 comments · Fixed by #671 or #864
Closed

Some of the queues are closed #669

zzl221000 opened this issue Apr 28, 2024 · 19 comments · Fixed by #671 or #864
Assignees

Comments

@zzl221000
Copy link

lavinmq version: 1.2.5

The status of some queues in the service becomes closed.
I didn't find any documentation about the queues being closed, is there a way to change them to a running status?

@viktorerlingsson
Copy link
Member

Hi @zzl221000 ,

Could you provide some more information about when this happens?

What does the log say?

@zzl221000
Copy link
Author

What does the log say?

2024-04-26 10:12:31 +00:00 [ERROR] queue[vhost=/ name=icp_inc_queue] - Queue closed due to error
2024-04-26 10:12:31 +00:00 [ERROR] client[vhost=/ address=10.42.0.160:42596] - Unexpected error, while reading: Channel is closed (LavinMQ::Queue::MessageStore::ClosedError)
  from /usr/share/crystal/src/hash.cr:1080:49 in 'delete'
  from /usr/share/crystal/src/mutex.cr:136:7 in 'delete_message'
  from /tmp/src/lavinmq/queue/queue.cr:750:7 in 'ack'
  from /tmp/src/lavinmq/client/channel.cr:483:9 in '??'
  from /tmp/src/lavinmq/client/channel.cr:470:11 in 'basic_ack'
  from /tmp/src/lavinmq/client/client.cr:367:31 in 'process_frame'
  from /tmp/src/lavinmq/client/client.cr:125:13 in 'read_loop'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???
2024-04-26 10:12:31 +00:00 [INFO] client[vhost=/ address=10.42.0.160:42596] - Closing, INTERNAL_ERROR - Unexpected error, please report

Something seems to have gone wrong.
The service was restarted once before the queue was closed
@viktorerlingsson

@viktorerlingsson
Copy link
Member

Thanks!
Could you provide any more information about how to reproduce the issue? Like what is your client trying to do when this happens?

For getting the queues back up I would suggest trying to restart LavinMQ.

@viktorerlingsson
Copy link
Member

Could you try upgrading LavinMQ to the latest version (v1.2.11)? There has been some bug fixes since v1.2.5

@zzl221000
Copy link
Author

I restarted the service when there were more clients consuming (about 200 connections) and a day later some of the queues appeared to be closed.
I checked the consumption code, all using python aiopika library and set up manual ack
Since there are tons of services using lavinmq, restarting it we have to do a lot of checking, so far after removing the exception queue and recreating the queue, everything is fine.
If there's no breaking change, I'll try to upgrade it

@viktorerlingsson
Copy link
Member

Okay, thanks for the info and great that you could get the queues up again!
There should have been no breaking changes since 1.2.5. Mostly bugfixes.

We'll try to look more into this during the coming week.

@zzl221000
Copy link
Author

@viktorerlingsson
Missing some information, queue messages between 500,000 - 50 million

@zzl221000
Copy link
Author

@viktorerlingsson
new closed queue

path=/var/lib/lavinmq/42099b4af021e53fd8fd4e056c2568d7c2e3ffa8/c8f5860bf52ffee2311cd7fbeae06af8a6514ae7/msgs.0000000406 pos=8388606 size=8388608 (LavinMQ::Queue::MessageStore::Error)
  from /tmp/src/lavinmq/queue/message_store.cr:377 in 'shift?'
  from /tmp/src/lavinmq/queue/queue.cr:674:45 in 'deliver_loop'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???
Caused by: Index out of bounds (IndexError)
  from /tmp/lib/amq-protocol/src/amq/protocol/properties.cr:61:65 in 'from_bytes'
  from /tmp/src/lavinmq/queue/message_store.cr:125:17 in 'shift?'
  from /tmp/src/lavinmq/queue/queue.cr:674:45 in 'deliver_loop'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???
2024-04-28T14:41:27.392404Z ERROR client[vhost=/ address=10.42.0.163:44330] Unexpected error, while reading: Channel is closed (LavinMQ::Queue::MessageStore::ClosedError)
  from /usr/share/crystal/src/hash.cr:1080:49 in 'delete'
  from /usr/share/crystal/src/mutex.cr:136:7 in 'delete_message'
  from /tmp/src/lavinmq/queue/queue.cr:750:7 in 'ack'
  from /tmp/src/lavinmq/client/channel.cr:483:9 in '??'
  from /tmp/src/lavinmq/client/channel.cr:470:11 in 'basic_ack'
  from /tmp/src/lavinmq/client/client.cr:367:31 in 'process_frame'
  from /tmp/src/lavinmq/client/client.cr:125:13 in 'read_loop'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

@viktorerlingsson
Copy link
Member

Thanks.

Could you please provide some more of the log just before this happened?

How's the resource use for your server? Is it running low on disk or RAM? We've seen similar errors when disk space has ran out.

@zzl221000
Copy link
Author

@viktorerlingsson
Okay, here's the log context for the exception.
There doesn't seem to be anything about the error queue

more logs
2024-04-28T12:42:04.984934Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T12:51:57.984748Z  INFO MessageStore Deleting unused segment 733
2024-04-28T12:52:05.140706Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T13:02:05.313183Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T13:03:05.469220Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T13:04:05.578381Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T13:05:05.719818Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T13:06:05.786268Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T13:07:05.881215Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T13:08:05.941260Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T13:09:06.084588Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T13:10:06.436639Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T13:12:05.556617Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T13:22:05.714684Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T13:32:05.874022Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T13:42:06.035609Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T13:52:06.193216Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T13:55:34.880891Z  INFO MessageStore Deleting unused segment 8394
2024-04-28T14:02:06.350203Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T14:03:06.402186Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T14:04:06.695808Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T14:12:06.641586Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T14:22:06.797711Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T14:32:07.054519Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T14:41:27.388255Z ERROR queue[vhost=/ name=company_siku_renyuan_yuanshi] Queue closed due to error
path=/var/lib/lavinmq/42099b4af021e53fd8fd4e056c2568d7c2e3ffa8/c8f5860bf52ffee2311cd7fbeae06af8a6514ae7/msgs.0000000406 pos=8388606 size=8388608 (LavinMQ::Queue::MessageStore::Error)
  from /tmp/src/lavinmq/queue/message_store.cr:377 in 'shift?'
  from /tmp/src/lavinmq/queue/queue.cr:674:45 in 'deliver_loop'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???
Caused by: Index out of bounds (IndexError)
  from /tmp/lib/amq-protocol/src/amq/protocol/properties.cr:61:65 in 'from_bytes'
  from /tmp/src/lavinmq/queue/message_store.cr:125:17 in 'shift?'
  from /tmp/src/lavinmq/queue/queue.cr:674:45 in 'deliver_loop'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

2024-04-28T14:41:27.392404Z ERROR client[vhost=/ address=10.42.0.163:44330] Unexpected error, while reading: Channel is closed (LavinMQ::Queue::MessageStore::ClosedError)
  from /usr/share/crystal/src/hash.cr:1080:49 in 'delete'
  from /usr/share/crystal/src/mutex.cr:136:7 in 'delete_message'
  from /tmp/src/lavinmq/queue/queue.cr:750:7 in 'ack'
  from /tmp/src/lavinmq/client/channel.cr:483:9 in '??'
  from /tmp/src/lavinmq/client/channel.cr:470:11 in 'basic_ack'
  from /tmp/src/lavinmq/client/client.cr:367:31 in 'process_frame'
  from /tmp/src/lavinmq/client/client.cr:125:13 in 'read_loop'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

2024-04-28T14:41:27.392406Z  INFO client[vhost=/ address=10.42.0.163:44330] Closing, INTERNAL_ERROR - Unexpected error, please report
2024-04-28T14:41:27.392428Z  INFO client[vhost=/ address=10.42.0.163:44330] Connection=10.42.0.163:44330 -> 10.42.6.243:5672 disconnected
2024-04-28T14:41:27.392985Z  INFO client[vhost=/ address=10.42.0.163:44330] Connection disconnected for user=whc
2024-04-28T14:41:27.397692Z  INFO client[vhost=/ address=10.42.0.163:45044][address: "10.42.0.163:45044"] Connection established for user=whc
2024-04-28T14:41:28.393775Z  INFO client[vhost=/ address=10.42.0.163:45044] Client disconnected: normally closed
2024-04-28T14:41:28.394016Z  INFO client[vhost=/ address=10.42.0.163:45044] Connection disconnected for user=whc
2024-04-28T14:41:32.294739Z  INFO client[vhost=/ address=10.42.0.163:48272][address: "10.42.0.163:48272"] Connection established for user=whc
2024-04-28T14:42:07.212482Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T14:43:07.263203Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T14:44:07.567906Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages
2024-04-28T14:52:07.410869Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T14:54:40.856085Z  INFO MessageStore Deleting unused segment 10284
2024-04-28T15:02:07.567198Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T15:12:07.727362Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T15:22:07.896796Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T15:32:08.152610Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T15:42:08.312356Z  INFO queue[vhost=/ name=ttl_mq_count_check] Expired 1 messages
2024-04-28T15:43:08.456344Z  INFO queue[vhost=/ name=data_dispatcher_ttl] Expired 1 messages

I can assure you that the resources will be adequate.
A 4core cpu 32G memroy and 500G ssd machine with just this one service.

@viktorerlingsson
Copy link
Member

viktorerlingsson commented Apr 29, 2024

Thanks!

For some reason LavinMQ is trying to read a message at position 8388606 in a file (msgs.0000000406) of size 8388608. There's only room for 2 more bytes in that file, so it obviously does not contain any more messages. Not sure yet if this is because the file has not been properly truncated or if LavinMQ has failed to switch to a new file.

Could you please check if msgs.0000000406 is the last msgs file in /var/lib/lavinmq/42099b4af021e53fd8fd4e056c2568d7c2e3ffa8/c8f5860bf52ffee2311cd7fbeae06af8a6514ae7/ ?

@zzl221000
Copy link
Author

Could you please check if msgs.0000000406 is the last msgs file in /var/lib/lavinmq/42099b4af021e53fd8fd4e056c2568d7c2e3ffa8/c8f5860bf52ffee2311cd7fbeae06af8a6514ae7/ ?

No problem, this should wait a few days
This queue has been recreated for the production service to work properly

@viktorerlingsson viktorerlingsson self-assigned this May 7, 2024
@kickster97 kickster97 changed the title Some of the queues are closed. Some of the queues are closed May 13, 2024
@zzl221000
Copy link
Author

@viktorerlingsson
I'm back. After a long wait, I finally encountered the unusually closed queue again.

path=/var/lib/lavinmq/42099b4af021e53fd8fd4e056c2568d7c2e3ffa8/980779e0c3bcf9eb0729de1bd96672272550be4e/msgs.0000000861 pos=8388602 size=8388608 (LavinMQ::Queue::MessageStore::Error)

Based on the log above, I investigated the current folder.
Apparently, this is the first message file

drwxr-xr-x   2 root root    4096 May  9 09:47 .
drwxr-xr-x 130 root root   12288 May 12 19:09 ..
-rw-r--r--   1 root root      16 May  8 15:03 .queue
-rw-r--r--   1 root root   95380 May  9 11:39 acks.0000000861
-rw-r--r--   1 root root       0 May  9 11:39 acks.0000000862
-rw-r--r--   1 root root       0 May  9 11:39 acks.0000000863
-rw-r--r--   1 root root       0 May  9 11:39 acks.0000000864
-rw-r--r--   1 root root       0 May  9 11:39 acks.0000000865

...

-rw-r--r--   1 root root       0 May  9 11:39 acks.0000000898
-rw-r--r--   1 root root 8388608 Apr 24 08:52 msgs.0000000861
-rw-r--r--   1 root root 8388608 Apr 24 08:52 msgs.0000000862
-rw-r--r--   1 root root 8388608 Apr 24 08:52 msgs.0000000863
-rw-r--r--   1 root root 8388608 Apr 24 08:52 msgs.0000000864

@viktorerlingsson
Copy link
Member

@zzl221000

Thanks! I see that the message file was created on Apr 24th, before you opened this issue. Did your LavinMQ instance crash or restart around that time?
We're having a hard time reproducing this error without forcefully shutting down or crashing LavinMQ.

We are working on a fix though.

@zzl221000
Copy link
Author

We have a set of k8s with expired node certificates, causing the master to be unable to communicate with the node, and the services on all of these nodes are restarted when the certificates are refreshed

@carlhoerberg
Copy link
Member

carlhoerberg commented May 16, 2024 via email

@zzl221000
Copy link
Author

You can just send the HUP signal to load new certificates.

Extended certificate expiration led to the eviction of all services on the node. After a prolonged eviction state, services were forcibly terminated once the node certificates were refreshed.

@viktorerlingsson
Copy link
Member

This can also happen in first?

Dec 03 09:34:19 lavinmq path=/var/lib/lavinmq/1a73969e9b97dc5c2cef7fd3ab81320246dd528a/f6ddffefd2fce94583467753c480f7de7d438da5/msgs.0000000003 pos=8388601 size=8388608 (LavinMQ::Queue::MessageStore::Error)
Dec 03 09:34:19 lavinmq   from usr/src/lavinmq_2.0.2/src/lavinmq/amqp/queue/message_store.cr:233 in 'first?'
Dec 03 09:34:19 lavinmq   from usr/src/lavinmq_2.0.2/src/lavinmq/amqp/queue/queue.cr:481:43 in 'message_expire_loop'
Dec 03 09:34:19 lavinmq   from usr/share/crystal/src/fiber.cr:143:11 in 'run'
Dec 03 09:34:19 lavinmq   from ???
Dec 03 09:34:19 lavinmq Caused by: Index out of bounds (IndexError)
Dec 03 09:34:19 lavinmq   from usr/src/lavinmq_2.0.2/lib/amq-protocol/src/amq/protocol/properties.cr:61:65 in 'from_bytes'
Dec 03 09:34:19 lavinmq   from usr/src/lavinmq_2.0.2/src/lavinmq/amqp/queue/message_store.cr:90:17 in 'first?'
Dec 03 09:34:19 lavinmq   from usr/src/lavinmq_2.0.2/src/lavinmq/amqp/queue/queue.cr:481:43 in 'message_expire_loop'
Dec 03 09:34:19 lavinmq   from usr/share/crystal/src/fiber.cr:143:11 in 'run'
Dec 03 09:34:19 lavinmq   from ???

@zzl221000
Copy link
Author

This can also happen in first?

In my instance, it has never occurred.

v1.2.5
Uptime 221d, 16h, 12m

viktorerlingsson added a commit that referenced this issue Dec 4, 2024
If a msg file is a few bytes larger than expected, first? will throw an IndexError and close the queue. This changes that behavior to instead rescue that IndexError and try to move on to the next segment file. Same fix as for shift? in #671

fixes #669 (again 🙂)
kickster97 pushed a commit that referenced this issue Dec 16, 2024
If a msg file is a few bytes larger than expected, first? will throw an IndexError and close the queue. This changes that behavior to instead rescue that IndexError and try to move on to the next segment file. Same fix as for shift? in #671

fixes #669 (again 🙂)
kickster97 pushed a commit that referenced this issue Dec 18, 2024
If a msg file is a few bytes larger than expected, first? will throw an IndexError and close the queue. This changes that behavior to instead rescue that IndexError and try to move on to the next segment file. Same fix as for shift? in #671

fixes #669 (again 🙂)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants