Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition when reading retry file during housekeeping intervals #997

Closed
andreleblanc11 opened this issue Mar 24, 2024 · 9 comments
Assignees
Labels
bug Something isn't working Priority 2 - Critical High Criticality or High Visibility Problem v3only Only affects v3 branches.

Comments

@andreleblanc11
Copy link
Member

I saw over the weekend that the AM server crashed due to the following error.

Traceback (most recent call last):
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py", line 248, in <module>
    i.start()
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/instance.py", line 239, in start
    self.running_instance.run()
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/flow/__init__.py", line 586, in run
    self._runCallbackMetrics()
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/flow/__init__.py", line 319, in _runCallbackMetrics
    self.metrics[module_name] = p()
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/flowcb/retry.py", line 166, in metricsReport
    return {'msgs_in_download_retry': len(self.download_retry), 'msgs_in_post_retry': len(self.post_retry)}
ValueError: __len__() should return >= 0

What I was able to find in the logs is that the retry file will get opened during a flow algorithm passage by the process that will soon die.

# Process 1
2024-03-24 00:57:12,029 [DEBUG] sarracenia.diskqueue msg_get_from_file DEBUG /home/sarra/.cache/sr3/flow/amserver/diskqueue
_post_retry_001 open read
2024-03-24 00:57:12,030 [INFO] sarracenia.flowcb.log after_work flowed ok: /datalocal/am/bulletins/20240324/IS/CWVH/01/ISAB
64_CWVH_240100___00001

And just before, another process will open the retry file for read also

# Process 2
2024-03-24 00:57:11,470 [DEBUG] sarracenia.diskqueue msg_get_from_file DEBUG /home/sarra/.cache/sr3/flow/amserver/diskqueue_post_retry_001.new open read

What I'm thinking is that there is some kind a race condition happening where the process that dies accesses the retry file just after the first one, which causes the crash.

I think this is normally avoided with multiple instances by denominating a retry file for each instance #. But with AM server, we fork from the same process so they all use the same retry file in consequence.

FYI, this only crashed the instance affected and it was able recover one second after. However, we got unlucky and lost 7/8 bulletins due to that crash as there were lots of files being transferred on the socket at the time of the crash.

@andreleblanc11 andreleblanc11 added bug Something isn't working sundew-migration issues with moving legacy configurations to Sarracenia v3only Only affects v3 branches. and removed sundew-migration issues with moving legacy configurations to Sarracenia labels Mar 24, 2024
@andreleblanc11 andreleblanc11 changed the title Race condition when reading retry file during housekeeping intervals Possible race condition when reading retry file during housekeeping intervals Mar 24, 2024
@petersilva
Copy link
Contributor

when we get a connection, the instance number is supposed to change... so the retry_file would be different. It's something to do with ensuring the instance number is properly propagated on startup.

@petersilva
Copy link
Contributor

but hang on... retrying a download on an AM message doesn't make any sense. likely should turn off retry (remove the callback) for AM receiver.) They only reason to retry a download is because you lost the connection half-way... there is no recovering.
AM protocol is just vanilla socket, there are no retries... so part of the bug here is that the retry file is created at all.

@andreleblanc11
Copy link
Member Author

Retry does work because we're downloading from the sarracenia in-line message contents that the AM sr3 plugin creates. I've already seen AM create a sarracenia message and then retry it for download.

While true that the AM socket won't resend the file contents, the sarracenia message will save it and can append it to the retry queue if ever the download fails.

@petersilva
Copy link
Contributor

I get that it "works" but it's not really a retry... I mean you are "downloading" what you already have in the message. I don't understand how a first "download" could fail, and a retry would succeed, oh... I guess disk full might do it. Do you know why the initial downloads failed, triggering the retry logic?

@andreleblanc11
Copy link
Member Author

When there's a retry with sr3's AM plugin, it's always caused by the same error. I've already noticed it in the past but ignored it as I thought retry is doing a good enough job to resend the files as is.

2024-03-18 12:41:08,816 [DEBUG] sarracenia.moth.amqp putNewMessage Exception details:
Traceback (most recent call last):
  File "/home/sarra/.local/lib/python3.10/site-packages/sarracenia/moth/amqp.py", line 743, in putNewMessage
    self.channel.basic_publish(AMQP_Message, exchange, topic, timeout=pub_timeout)
  File "/usr/lib/python3/dist-packages/amqp/channel.py", line 1779, in _basic_publish
    self.connection.drain_events(timeout=0)
  File "/usr/lib/python3/dist-packages/amqp/connection.py", line 522, in drain_events
    while not self.blocking_read(timeout):
  File "/usr/lib/python3/dist-packages/amqp/connection.py", line 527, in blocking_read
    frame = self.transport.read_frame()
  File "/usr/lib/python3/dist-packages/amqp/transport.py", line 310, in read_frame
    frame_header = read(7, True)
  File "/usr/lib/python3/dist-packages/amqp/transport.py", line 594, in _read
    raise OSError('Server unexpectedly closed connection')
OSError: Server unexpectedly closed connection

The error always happens at the beginning of the forked process when the first message is created and tried to publish downstream. I think it fails because the amqp connection is the same as the first instance and crashes as a result... not sure

Once a new batch of messages appears, sarracenia does the right thing and re-establishes the connection downstream

2024-03-18 12:42:07,856 [INFO] sarracenia.flowcb.log after_work flowed ok: /datalocal/am/bulletins/20240318/FX/CYCX/09/FXCN
48_CYCX_180930___00001
2024-03-18 12:42:07,878 [DEBUG] amqp _on_start Start from server, version: 0.9, properties: {'capabilities': {'publisher_co
nfirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked'
: True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': Tru
e}, 'cluster_name': 'rabbit@downstream-node.cmc.ec.gc.ca', 'copyright': 'Copyright (C) 2007-2019 Pivotal Software, Inc.', 'informa
tion': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'platform': 'Erlang/OTP 21.3', 'product': 'RabbitMQ', 'vers
ion': '3.7.13'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
2024-03-18 12:42:07,879 [DEBUG] amqp __init__ using channel_id: 1
2024-03-18 12:42:07,880 [DEBUG] amqp _on_open_ok Channel open
2024-03-18 12:42:07,880 [DEBUG] amqp __init__ using channel_id: 2
2024-03-18 12:42:07,881 [DEBUG] amqp _on_open_ok Channel open
2024-03-18 12:42:07,881 [DEBUG] sarracenia.moth.amqp putSetup putSetup ... 1. declaring ['xs_MSC-CA']
2024-03-18 12:42:07,881 [INFO] sarracenia.moth.amqp putSetup exchange declared: xs_MSC-CA (as: amqps://MSC-CA@downstream-node.cmc.ec.gc.ca)
2024-03-18 12:42:07,881 [DEBUG] sarracenia.moth.amqp putSetup putSetup ... Done!

@andreleblanc11
Copy link
Member Author

Me and @petersilva did some work and were able to fix the AMQP error I was encountering by moving the retry __init__ declarations to on_start. This in turn allows us to have putSetup run first by sarracenia and avoids calling it after encountering the error -> 23a648e

@petersilva petersilva added the Priority 2 - Critical High Criticality or High Visibility Problem label Mar 27, 2024
@petersilva
Copy link
Contributor

This is pretty serious... I think it is a blocker for replacing the old MM systems.
The way the connection establishment works... it can lose messages on every connection.
stable state is ok... but this is not really solid.

@petersilva
Copy link
Contributor

petersilva commented Mar 27, 2024

To clarify... the problem wasn't the when the child was being initialized, it was that all children "thought" they were instance 1, because they were inheriiting their initialization from instance 1 (the listener.) The ps command output showed they were instance 1, and when choosing to create instance files, it would be the wrong ones.

The solution was to have the children to an os.execl() ... that is to re-launch the process with the correct instance number assigned, so the logs and pid files etc.. all get properly assigned. This eliminates race conditions because there is only one instance 1.
Now all the children start up, and ps output aligns with which instance each child thinks it is.

The shutdown also didn't quite work... the socket.recv() is blocking. Added timeout processing and short timeouts so the code to exit works properly.

@andreleblanc11
Copy link
Member Author

PR #1005 solves the original race condition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Priority 2 - Critical High Criticality or High Visibility Problem v3only Only affects v3 branches.
Projects
None yet
Development

No branches or pull requests

2 participants