Possible race condition when reading retry file during housekeeping intervals #997

andreleblanc11 · 2024-03-24T13:31:47Z

I saw over the weekend that the AM server crashed due to the following error.

Traceback (most recent call last):
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py", line 248, in <module>
    i.start()
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py", line 239, in start
    self.running_instance.run()
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/flow/__init__.py", line 586, in run
    self._runCallbackMetrics()
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/flow/__init__.py", line 319, in _runCallbackMetrics
    self.metrics[module_name] = p()
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/flowcb/retry.py", line 166, in metricsReport
    return {'msgs_in_download_retry': len(self.download_retry), 'msgs_in_post_retry': len(self.post_retry)}
ValueError: __len__() should return >= 0

What I was able to find in the logs is that the retry file will get opened during a flow algorithm passage by the process that will soon die.

# Process 1
2024-03-24 00:57:12,029 [DEBUG] sarracenia.diskqueue msg_get_from_file DEBUG /home/sarra/.cache/sr3/flow/amserver/diskqueue
_post_retry_001 open read
2024-03-24 00:57:12,030 [INFO] sarracenia.flowcb.log after_work flowed ok: /datalocal/am/bulletins/20240324/IS/CWVH/01/ISAB
64_CWVH_240100___00001

And just before, another process will open the retry file for read also

# Process 2
2024-03-24 00:57:11,470 [DEBUG] sarracenia.diskqueue msg_get_from_file DEBUG /home/sarra/.cache/sr3/flow/amserver/diskqueue_post_retry_001.new open read

What I'm thinking is that there is some kind a race condition happening where the process that dies accesses the retry file just after the first one, which causes the crash.

I think this is normally avoided with multiple instances by denominating a retry file for each instance #. But with AM server, we fork from the same process so they all use the same retry file in consequence.

FYI, this only crashed the instance affected and it was able recover one second after. However, we got unlucky and lost 7/8 bulletins due to that crash as there were lots of files being transferred on the socket at the time of the crash.

The text was updated successfully, but these errors were encountered:

petersilva · 2024-03-24T15:04:20Z

when we get a connection, the instance number is supposed to change... so the retry_file would be different. It's something to do with ensuring the instance number is properly propagated on startup.

petersilva · 2024-03-24T15:17:58Z

but hang on... retrying a download on an AM message doesn't make any sense. likely should turn off retry (remove the callback) for AM receiver.) They only reason to retry a download is because you lost the connection half-way... there is no recovering.
AM protocol is just vanilla socket, there are no retries... so part of the bug here is that the retry file is created at all.

andreleblanc11 · 2024-03-24T16:16:46Z

Retry does work because we're downloading from the sarracenia in-line message contents that the AM sr3 plugin creates. I've already seen AM create a sarracenia message and then retry it for download.

While true that the AM socket won't resend the file contents, the sarracenia message will save it and can append it to the retry queue if ever the download fails.

petersilva · 2024-03-24T16:29:20Z

I get that it "works" but it's not really a retry... I mean you are "downloading" what you already have in the message. I don't understand how a first "download" could fail, and a retry would succeed, oh... I guess disk full might do it. Do you know why the initial downloads failed, triggering the retry logic?

andreleblanc11 · 2024-03-25T12:41:29Z

When there's a retry with sr3's AM plugin, it's always caused by the same error. I've already noticed it in the past but ignored it as I thought retry is doing a good enough job to resend the files as is.

2024-03-18 12:41:08,816 [DEBUG] sarracenia.moth.amqp putNewMessage Exception details:
Traceback (most recent call last):
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/moth/amqp.py", line 743, in putNewMessage
    self.channel.basic_publish(AMQP_Message, exchange, topic, timeout=pub_timeout)
  File "/usr/lib/python3/dist-packages/amqp/channel.py", line 1779, in _basic_publish
    self.connection.drain_events(timeout=0)
  File "/usr/lib/python3/dist-packages/amqp/connection.py", line 522, in drain_events
    while not self.blocking_read(timeout):
  File "/usr/lib/python3/dist-packages/amqp/connection.py", line 527, in blocking_read
    frame = self.transport.read_frame()
  File "/usr/lib/python3/dist-packages/amqp/transport.py", line 310, in read_frame
    frame_header = read(7, True)
  File "/usr/lib/python3/dist-packages/amqp/transport.py", line 594, in _read
    raise OSError('Server unexpectedly closed connection')
OSError: Server unexpectedly closed connection

The error always happens at the beginning of the forked process when the first message is created and tried to publish downstream. I think it fails because the amqp connection is the same as the first instance and crashes as a result... not sure

Once a new batch of messages appears, sarracenia does the right thing and re-establishes the connection downstream

2024-03-18 12:42:07,856 [INFO] sarracenia.flowcb.log after_work flowed ok: /datalocal/am/bulletins/20240318/FX/CYCX/09/FXCN
48_CYCX_180930___00001
2024-03-18 12:42:07,878 [DEBUG] amqp _on_start Start from server, version: 0.9, properties: {'capabilities': {'publisher_co
nfirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked'
: True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': Tru
e}, 'cluster_name': 'rabbit@downstream-node.cmc.ec.gc.ca', 'copyright': 'Copyright (C) 2007-2019 Pivotal Software, Inc.', 'informa
tion': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'platform': 'Erlang/OTP 21.3', 'product': 'RabbitMQ', 'vers
ion': '3.7.13'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
2024-03-18 12:42:07,879 [DEBUG] amqp __init__ using channel_id: 1
2024-03-18 12:42:07,880 [DEBUG] amqp _on_open_ok Channel open
2024-03-18 12:42:07,880 [DEBUG] amqp __init__ using channel_id: 2
2024-03-18 12:42:07,881 [DEBUG] amqp _on_open_ok Channel open
2024-03-18 12:42:07,881 [DEBUG] sarracenia.moth.amqp putSetup putSetup ... 1. declaring ['xs_MSC-CA']
2024-03-18 12:42:07,881 [INFO] sarracenia.moth.amqp putSetup exchange declared: xs_MSC-CA (as: amqps://MSC-CA@downstream-node.cmc.ec.gc.ca)
2024-03-18 12:42:07,881 [DEBUG] sarracenia.moth.amqp putSetup putSetup ... Done!

andreleblanc11 · 2024-03-25T20:09:27Z

Me and @petersilva did some work and were able to fix the AMQP error I was encountering by moving the retry __init__ declarations to on_start. This in turn allows us to have putSetup run first by sarracenia and avoids calling it after encountering the error -> 23a648e

petersilva · 2024-03-27T12:30:40Z

This is pretty serious... I think it is a blocker for replacing the old MM systems.
The way the connection establishment works... it can lose messages on every connection.
stable state is ok... but this is not really solid.

…rror handling of broken sockets

petersilva · 2024-03-27T16:32:44Z

To clarify... the problem wasn't the when the child was being initialized, it was that all children "thought" they were instance 1, because they were inheriiting their initialization from instance 1 (the listener.) The ps command output showed they were instance 1, and when choosing to create instance files, it would be the wrong ones.

The solution was to have the children to an os.execl() ... that is to re-launch the process with the correct instance number assigned, so the logs and pid files etc.. all get properly assigned. This eliminates race conditions because there is only one instance 1.
Now all the children start up, and ps output aligns with which instance each child thinks it is.

The shutdown also didn't quite work... the socket.recv() is blocking. Added timeout processing and short timeouts so the code to exit works properly.

…rror handling of broken sockets

andreleblanc11 · 2024-03-28T17:43:44Z

PR #1005 solves the original race condition

…rror handling of broken sockets

andreleblanc11 added bug Something isn't working sundew-migration issues with moving legacy configurations to Sarracenia v3only Only affects v3 branches. and removed sundew-migration issues with moving legacy configurations to Sarracenia labels Mar 24, 2024

andreleblanc11 changed the title ~~Race condition when reading retry file during housekeeping intervals~~ Possible race condition when reading retry file during housekeeping intervals Mar 24, 2024

andreleblanc11 added a commit that referenced this issue Mar 25, 2024

#997 Move retry logic to on_start

23a648e

petersilva added the Priority 2 - Critical High Criticality or High Visibility Problem label Mar 27, 2024

andreleblanc11 added a commit that referenced this issue Mar 27, 2024

#997 Correctly launch child through os.execl. Corrects instance numbers.

ccb6ab9

andreleblanc11 added a commit that referenced this issue Mar 27, 2024

#997 Correct method names

34f3a9d

andreleblanc11 added a commit that referenced this issue Mar 27, 2024

#997 Add a socket timeout to mimic sarracenia flow and correct poor e…

cac39e4

…rror handling of broken sockets

petersilva assigned andreleblanc11 Mar 27, 2024

andreleblanc11 added a commit that referenced this issue Mar 27, 2024

#997 Add method to cleanup socket descriptor files

d1f1f05

andreleblanc11 added a commit that referenced this issue Mar 27, 2024

#997 Add method to cleanup socket descriptor files

12e4514

andreleblanc11 added a commit that referenced this issue Mar 28, 2024

#997 Add timeout checks for parent as well. Avoids sr3 sanity restarts.

5624fa4

petersilva pushed a commit that referenced this issue Mar 28, 2024

#997 Move retry logic to on_start

c92b026

petersilva pushed a commit that referenced this issue Mar 28, 2024

#997 Correctly launch child through os.execl. Corrects instance numbers.

f50df45

petersilva pushed a commit that referenced this issue Mar 28, 2024

#997 Correct method names

3fa3769

petersilva pushed a commit that referenced this issue Mar 28, 2024

#997 Add a socket timeout to mimic sarracenia flow and correct poor e…

5e15334

…rror handling of broken sockets

petersilva pushed a commit that referenced this issue Mar 28, 2024

#997 Add method to cleanup socket descriptor files

d904ca9

petersilva pushed a commit that referenced this issue Mar 28, 2024

#997 Add timeout checks for parent as well. Avoids sr3 sanity restarts.

6a8272b

andreleblanc11 mentioned this issue Mar 28, 2024

Issue997 - Eliminate retry file race condition and make AM work more like a sarracenia flow #1005

Merged

andreleblanc11 closed this as completed Mar 28, 2024

andreleblanc11 mentioned this issue Apr 3, 2024

AM Miscellaneous fixes to mimic Sundew behaviour #980

Open

petersilva pushed a commit that referenced this issue Apr 4, 2024

#997 Move retry logic to on_start

22a3b92

petersilva pushed a commit that referenced this issue Apr 4, 2024

#997 Correctly launch child through os.execl. Corrects instance numbers.

40481f0

petersilva pushed a commit that referenced this issue Apr 4, 2024

#997 Correct method names

58ddcf4

petersilva pushed a commit that referenced this issue Apr 4, 2024

#997 Add a socket timeout to mimic sarracenia flow and correct poor e…

4bd6cd0

…rror handling of broken sockets

petersilva pushed a commit that referenced this issue Apr 4, 2024

#997 Add method to cleanup socket descriptor files

5565b24

petersilva pushed a commit that referenced this issue Apr 4, 2024

#997 Add timeout checks for parent as well. Avoids sr3 sanity restarts.

cb642b1

andreleblanc11 mentioned this issue Apr 6, 2024

AM socket forking connections despite remaining ESTABLISHED #908

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible race condition when reading retry file during housekeeping intervals #997

Possible race condition when reading retry file during housekeeping intervals #997

andreleblanc11 commented Mar 24, 2024

petersilva commented Mar 24, 2024

petersilva commented Mar 24, 2024

andreleblanc11 commented Mar 24, 2024

petersilva commented Mar 24, 2024

andreleblanc11 commented Mar 25, 2024

andreleblanc11 commented Mar 25, 2024

petersilva commented Mar 27, 2024

petersilva commented Mar 27, 2024 •

edited

Loading

andreleblanc11 commented Mar 28, 2024

Possible race condition when reading retry file during housekeeping intervals #997

Possible race condition when reading retry file during housekeeping intervals #997

Comments

andreleblanc11 commented Mar 24, 2024

petersilva commented Mar 24, 2024

petersilva commented Mar 24, 2024

andreleblanc11 commented Mar 24, 2024

petersilva commented Mar 24, 2024

andreleblanc11 commented Mar 25, 2024

andreleblanc11 commented Mar 25, 2024

petersilva commented Mar 27, 2024

petersilva commented Mar 27, 2024 • edited Loading

andreleblanc11 commented Mar 28, 2024

petersilva commented Mar 27, 2024 •

edited

Loading