Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using s6 service ready notification instead of sleep #6469

Closed

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Jun 7, 2024

fixes #6448

For the container that includes the services, aiida prepare script is called as soon as the RabbitMQ startup script has
been launched, but it can take a while for the RMQ service to come up. If verdi presto is called straight away
it is possible it tries to connect to the service before that and it will configure the profile without a broker.

This PR using s6-notifyoncheck to do the health check every 500ms until the RMQ is ready. As suggested by just-containers/s6-overlay#582. From my investigation, RMQ didn't provide the pre-active readiness notification to the file descriptor by the its daemon. Thus I use readiness polling.

Copy link

codecov bot commented Jun 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.82%. Comparing base (ef60b66) to head (76c87c3).
Report is 116 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6469      +/-   ##
==========================================
+ Coverage   77.51%   77.82%   +0.31%     
==========================================
  Files         560      562       +2     
  Lines       41444    41889     +445     
==========================================
+ Hits        32120    32594     +474     
+ Misses       9324     9295      -29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sphuber
Copy link
Contributor

sphuber commented Jun 7, 2024

Thanks @unkcpz did you try to build and run locally? I could build just fine, but when I run

TAG=":latest" docker compose -f docker-compose.aiida-core-with-services.yml up

it seems to get stuck:

aiida-1  | selecting default max_connections ... 100
aiida-1  | selecting default shared_buffers ... 128MB
aiida-1  | selecting default time zone ... Europe/Zurich
aiida-1  | creating configuration files ... ok
aiida-1  | running bootstrap script ... ok
aiida-1  | performing post-bootstrap initialization ... 2024-06-07 15:06:49.743165+00:00 [warning] <0.130.0> Using RABBITMQ_ADVANCED_CONFIG_FILE: /opt/rabbitmq_server-3.10.18/etc/rabbitmq/advanced.config
aiida-1  | ok
aiida-1  | Error: unable to perform an operation on node 'rabbit@localhost'. Please see diagnostics information and suggestions below.
aiida-1  | 
aiida-1  | Most common reasons for this are:
aiida-1  | 
aiida-1  |  * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
aiida-1  |  * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
aiida-1  |  * Target node is not running
aiida-1  | 
aiida-1  | In addition to the diagnostics info below:
aiida-1  | 
aiida-1  |  * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
aiida-1  |  * Consult server logs on node rabbit@localhost
aiida-1  |  * If target node is configured to use long node names, don't forget to use --longnames with CLI tools
aiida-1  | 
aiida-1  | DIAGNOSTICS
aiida-1  | ===========
aiida-1  | 
aiida-1  | attempted to contact: [rabbit@localhost]
aiida-1  | 
aiida-1  | rabbit@localhost:
aiida-1  |   * connected to epmd (port 4369) on localhost
aiida-1  |   * epmd reports: node 'rabbit' not running at all
aiida-1  |                   no other nodes on localhost
aiida-1  |   * suggestion: start the node
aiida-1  | 
aiida-1  | Current node details:
aiida-1  |  * node name: 'rabbitmqcli-854-rabbit@localhost'
aiida-1  |  * effective user's home directory: /home/aiida
aiida-1  |  * Erlang cookie hash: bT3Sb8VwvV4uzQNrwGXRcQ==
aiida-1  | 
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  | syncing data to disk ... 2024-06-07 15:06:51.599988+00:00 [info] <0.221.0> Feature flags: list of feature flags found:
aiida-1  | 2024-06-07 15:06:51.600028+00:00 [info] <0.221.0> Feature flags:   [ ] classic_mirrored_queue_version
aiida-1  | 2024-06-07 15:06:51.600045+00:00 [info] <0.221.0> Feature flags:   [ ] classic_queue_type_delivery_support
aiida-1  | 2024-06-07 15:06:51.600092+00:00 [info] <0.221.0> Feature flags:   [ ] implicit_default_bindings
aiida-1  | 2024-06-07 15:06:51.600110+00:00 [info] <0.221.0> Feature flags:   [ ] maintenance_mode_status
aiida-1  | 2024-06-07 15:06:51.600127+00:00 [info] <0.221.0> Feature flags:   [ ] quorum_queue
aiida-1  | 2024-06-07 15:06:51.600138+00:00 [info] <0.221.0> Feature flags:   [ ] stream_queue
aiida-1  | 2024-06-07 15:06:51.600198+00:00 [info] <0.221.0> Feature flags:   [ ] user_limits
aiida-1  | 2024-06-07 15:06:51.600209+00:00 [info] <0.221.0> Feature flags:   [ ] virtual_host_metadata
aiida-1  | 2024-06-07 15:06:51.600222+00:00 [info] <0.221.0> Feature flags: feature flag states written to disk: yes
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  | 2024-06-07 15:06:51.976208+00:00 [notice] <0.44.0> Application syslog exited with reason: stopped
aiida-1  | 2024-06-07 15:06:51.976270+00:00 [notice] <0.221.0> Logging: switching to configured handler(s); following messages may not be visible in this log output
aiida-1  | 
aiida-1  |   ##  ##      RabbitMQ 3.10.18
aiida-1  |   ##  ##
aiida-1  |   ##########  Copyright (c) 2007-2023 VMware, Inc. or its affiliates.
aiida-1  |   ######  ##
aiida-1  |   ##########  Licensed under the MPL 2.0. Website: https://rabbitmq.com
aiida-1  | 
aiida-1  |   Erlang:      24.2.1 [jit]
aiida-1  |   TLS Library: OpenSSL - OpenSSL 3.0.2 15 Mar 2022
aiida-1  |   Release series support status: supported
aiida-1  | 
aiida-1  |   Doc guides:  https://rabbitmq.com/documentation.html
aiida-1  |   Support:     https://rabbitmq.com/contact.html
aiida-1  |   Tutorials:   https://rabbitmq.com/getstarted.html
aiida-1  |   Monitoring:  https://rabbitmq.com/monitoring.html
aiida-1  | 
aiida-1  |   Logs: /home/aiida/.rabbitmq/log/rabbit@localhost.log
aiida-1  |         /home/aiida/.rabbitmq/log/rabbit@localhost_upgrade.log
aiida-1  |         <stdout>
aiida-1  | 
aiida-1  |   Config file(s): /opt/rabbitmq_server-3.10.18/etc/rabbitmq/advanced.config
aiida-1  | 
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  |   Starting broker...ok
aiida-1  | 
aiida-1  | 
aiida-1  | Success. You can now start the database server using:
aiida-1  | 
aiida-1  |     pg_ctl -D /home/aiida/.postgresql -l logfile start
aiida-1  | 
aiida-1  | initdb: warning: enabling "trust" authentication for local connections
aiida-1  | initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
aiida-1  | s6-rc: info: service postgresql-init successfully started
aiida-1  | s6-rc: info: service postgresql: starting
aiida-1  | waiting for server to start.... done
aiida-1  | server started
aiida-1  | s6-rc: info: service postgresql successfully started
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  | s6-rc: info: service postgresql-prepare successfully started
aiida-1  |  completed with 0 plugins.

but I am not sure if this is because I am doing something wrong

@unkcpz
Copy link
Member Author

unkcpz commented Jun 7, 2024

Yes, I tested. The reason you got stuck is that you run this image with docker compose. Docker compose firing all the containers but didn't know where you want to get into which container. You can run docker exec -i <container-id> bash to get into the container.
What I usually run is by for with-services container is by docker run --rm -it aiidateam/aiida-core-with-services:latest bash. The bash is the entry point of the container.

@sphuber
Copy link
Contributor

sphuber commented Jun 7, 2024

The reason you got stuck is that you run this image with docker compose. Docker compose firing all the containers but didn't know where you want to get into which container.

Huh, but the aiida-core-with-services.yml only defines the aiida container as the services are integrated. So why would it try to launch multiple containers as you say?

For what it is worth, I tried running:

docker run --rm -it aiidateam/aiida-core-with-services:latest bash

but it seems to get stuck with a problem in RabbitMQ at the same point

  Starting broker...Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
ok

initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    pg_ctl -D /home/aiida/.postgresql -l logfile start

s6-rc: info: service postgresql-init successfully started
s6-rc: info: service postgresql: starting
waiting for server to start.... done
server started
s6-rc: info: service postgresql successfully started
s6-rc: info: service postgresql-prepare successfully started
 completed with 0 plugins.

@unkcpz
Copy link
Member Author

unkcpz commented Jun 7, 2024

That is weird, it is all fine in my computer, can you try to docker exec into the stuck container and run rabbitmq-diagnostics check_running from another shell?

@sphuber
Copy link
Contributor

sphuber commented Jun 11, 2024

That is weird, it is all fine in my computer, can you try to docker exec into the stuck container and run rabbitmq-diagnostics check_running from another shell?

That actually seems to work:

(base) aiida@ecd0bdcd1a87:~$ rabbitmq-diagnostics check_running
Checking if RabbitMQ is running on node rabbit@localhost ...
RabbitMQ on node rabbit@localhost is fully booted and running
(base) aiida@ecd0bdcd1a87:~$ rabbitmq-diagnostics is_running
Asking node rabbit@localhost for its status ...
RabbitMQ on node rabbit@localhost is fully booted and running

But the shell with which I ran docker run keeps hanging and never drops me into the bash prompt. Weird....

If I CTRL+C the hanging docker run I get:

^Cs6-rc: warning: received INT, aborting longrun transitions and exiting asap
s6-rc: warning: unable to start service rabbitmq: command crashed with signal 2

so it does indeed seem to have a problem with the service startup. Note that after the CTRL+C it does drop me into a bash shell, but RMQ does not seem to be running and the profile has not been created.

(base) aiida@580ba47ac564:/home/aiida$ verdi status
 ✔ version:     AiiDA v2.5.1.post0
 ✔ config:      /home/aiida/.aiida
 ⏺ profile:     no profile configured yet
Report: Run `verdi presto` to automatically setup a profile using all defaults or use `verdi profile setup` for more control.
(base) aiida@580ba47ac564:/home/aiida$ verdi presto
Report: Option `--use-postgres` not enabled: configuring the profile to use SQLite.
Report: RabbitMQ server not found: configuring the profile without a broker.
Report: Initialising the storage backend.
Report: Storage initialisation completed.
Success: Created new profile `presto`.
Success: Configured the localhost as a computer.

Final data point: when I leave the docker run blocking and hanging, and then docker exec into it in another shell, in that shell RMQ seems to be running and so running verdi presto works as expected and it can connect to RMQ.

Copy link
Member Author

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber thanks for the test and write up. It is strange and I have no idea why not working for you. I tested on my arm64 Macbook as well and I didn't get stuck.

I am curious to run with your image, maybe you can upload to docker.io and I can play with it to see what is the different? Meanwhile docker.io can show the layers so probably I can spot the different by compare layer by layer.

@danielhollas
Copy link
Collaborator

FWIW On Fedora 39 using Podman instead of docker, the following also worked for me (I got a bash interactive prompt)

podman run --rm -it ghcr.io/aiidateam/aiida-core-with-services:latest bash

@sphuber
Copy link
Contributor

sphuber commented Jun 12, 2024

FWIW On Fedora 39 using Podman instead of docker, the following also worked for me (I got a bash interactive prompt)

podman run --rm -it ghcr.io/aiidateam/aiida-core-with-services:latest bash

But that latest on ghcr.io is not the image from this PR, is it? I think the current workflow definition no longer pushes to the registry in a PR. Could you perhaps try checking out this branch and build locally and try to run it @danielhollas ?

@danielhollas
Copy link
Collaborator

I am afk and will not get to this in the near future. But the image will get pushed to ghcr.io if you push this branch to origin. In general PRs modifying the docker image should be opened from branches on this origin repo for this reason.

@unkcpz
Copy link
Member Author

unkcpz commented Jun 13, 2024

@danielhollas I think if I make the PR from repo to trigger the whole docker CI, it is not push to ghcr.io with pr-xx tag anymore, correct? Otherwise we can ask @sphuber to try the image being built from CI.
At least from the all passed tests that the image built from CI is working as expected..

@danielhollas
Copy link
Collaborator

It will. The tag will be different, it's the name of the branch. Also you can always download by digest.

@unkcpz
Copy link
Member Author

unkcpz commented Jun 13, 2024

It is not apparently, or I did something wrong? see #6475

@danielhollas
Copy link
Collaborator

danielhollas commented Jun 13, 2024 via email

@unkcpz
Copy link
Member Author

unkcpz commented Jun 13, 2024

Is the workflow running on main?

Yes, it runs in main.

@skarnet
Copy link

skarnet commented Jun 14, 2024

@sphuber It sounds like rabbitmq-server is properly starting, but the rabbitmq-diagnostics check_running program invoked from data/check does not properly detect that, or fails to exit 0, so the supervisor does not detect that rabbitmq-server is ready. Could you redirect, in data/check, the stdout and stderr of rabbitmq-diagnostics to a file, and post the contents so we can see what's going on?

@sphuber
Copy link
Contributor

sphuber commented Jun 15, 2024

@skarnet thanks for the suggestion. It basically shows the same output I already shared before:

Checking if RabbitMQ is running on node rabbit@localhost ...
Error: unable to perform an operation on node 'rabbit@localhost'. Please see diagnostics information and suggestions below.

Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running

In addition to the diagnostics info below:

 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@localhost
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools

DIAGNOSTICS
===========

attempted to contact: [rabbit@localhost]

rabbit@localhost:
  * connected to epmd (port 4369) on localhost
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on localhost
  * suggestion: start the node

Current node details:
 * node name: 'rabbitmqcli-371-rabbit@localhost'
 * effective user's home directory: /home/aiida
 * Erlang cookie hash: nwhUk5qM8X4W/2Jko5yVSg==

Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)

But again when I drop into the container with docker exec in another shell, rabbitmq-diagnostics check_running confirms that RMQ is up and running

@skarnet
Copy link

skarnet commented Jun 16, 2024

That is strange; there has to be some environmental difference between the failing and successful invocations of rabbitmq-diagnostics check_running.

Can you strace both, so we can take a look at the result? i.e. in data/check, as well as in your docker exec environment, replace the invocation with strace -v -s256 -- rabbitmq-diagnostics check_running and store the stderr somewhere. Caution: it will likely be very verbose.

@sphuber
Copy link
Contributor

sphuber commented Jul 11, 2024

Can you strace both, so we can take a look at the result? i.e. in data/check, as well as in your docker exec environment, replace the invocation with strace -v -s256 -- rabbitmq-diagnostics check_running and store the stderr somewhere. Caution: it will likely be very verbose.

Thanks @skarnet I finally found the time to get around to trying your suggestion. I have attached the log.
rabbitmq-diag.log
Hopefully there is something useful in there.

@sphuber
Copy link
Contributor

sphuber commented Jul 11, 2024

Actually, I tried switching rabbitmq-diagnostics check_running to rabbitmq-diagnostics is_running and that seems to be working.

@skarnet
Copy link

skarnet commented Jul 11, 2024

Actually, I tried switching rabbitmq-diagnostics check_running to rabbitmq-diagnostics is_running and that seems to be working.

I'm glad you found something that worked, because the rabbitmq-diagnostics execution is way too complex for me to find the reason in the strace. Looking at the documentation page, it sounds like rabbitmq-diagnostics is_running is indeed the correct command to use in a data/check program.

Cheers!

@sphuber
Copy link
Contributor

sphuber commented Jul 11, 2024

Actually, I tried switching rabbitmq-diagnostics check_running to rabbitmq-diagnostics is_running and that seems to be working.

I'm glad you found something that worked, because the rabbitmq-diagnostics execution is way too complex for me to find the reason in the strace. Looking at the documentation page, it sounds like rabbitmq-diagnostics is_running is indeed the correct command to use in a data/check program.

Cheers!

Thanks for your help anyway @skarnet much appreciated

@unkcpz
Copy link
Member Author

unkcpz commented Jul 11, 2024

Thanks both! I update the PR.

@sphuber
Copy link
Contributor

sphuber commented Jul 11, 2024

Superseded by #6475

@sphuber sphuber closed this Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace the sleep in aiida-prepare.sh startup script of Docker images with more clever solution
4 participants