Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker getting removed #850

Open
FNTwin opened this issue Oct 8, 2024 · 18 comments
Open

Worker getting removed #850

FNTwin opened this issue Oct 8, 2024 · 18 comments

Comments

@FNTwin
Copy link

FNTwin commented Oct 8, 2024

I'm running some SPC with the qcportal suite and I'm using the SlurmExecutor to scale up the number of computation on slurm. The issue is that not all the workers requested from the config gets allocated (could be that my cluster is at capacity) but after a while, the workers correctly allocated get CANCELLED. Looking at the interchange.log I can see the Managers have compatible Parsl version and Python version but at some point I get:

Interchange-Command _command_server [INFO] Received HOLD_WORKER 
MainThread expire_bad_managers [WARNING] Too many heartbeats missed for manager

The workers/managers get removed from the SLURM queue and new one get allocated (in less numbers).

Some addictional info, i'm using a qcfractal manager config with update_frequency: 10, while the server config has the heartbeat_frequency:10 too. I tried to increase both the update_frequency and the heartbeat_frequency with no avail.

Edit: after a while workers are not doing any task and I had to manually kill the jobs

qcportal.client_base.PortalRequestError: Request failed: Manager slurm_manager-slclogin1-5bb9853a-0fbd-43e2-9c2a-f8840ff78b64 is not active (HTTP status 400)

or

qcportal.client_base.PortalRequestError: Request failed: Cannot update resource stats for manager slurm_manager-slclogin1-5f3b9eb4-a2e6-4822-a8ec-d3dc0f1d622f - is not active (HTTP status 400)

On further investigation I noticed that the first batch of workers/managers are killed because their Idle time is more than the max_idletime for the slurm executor in Parsl even when they are running.

parsl.jobs.strategy: Idle time has reached 120.0s for executor slurm_executor; scaling in
@bennybp
Copy link
Contributor

bennybp commented Oct 9, 2024

Are you submitting the managers themselves as jobs, or running a manager on the head node which delegates out to the SLURM queue? If it's the former, there may be a misconfiguration somewhere.

The first error here (Interchange-Command... comes from Parsl error, which means that Parsl itself is having issues spinning up workers.

Either way, is there anything in the logs for the QCFractal instance? That will usually have a little bit more info, although I'm guessing it's just managers not reporting back.

@FNTwin
Copy link
Author

FNTwin commented Oct 9, 2024

Hi @bennybp , The configuration that I setup is Server on the Head Node + Manager on the Head Node that delegates out the SLURM queues with the workers. I'll restart the server with DEBUG mode for the logging as I didn't see any useful logs from the info.

The first error here (Interchange-Command... comes from Parsl error, which means that Parsl itself is having issues spinning up workers.

Yep, for some reason it spins up correctly the workers (slurm allocate the jobs) but they get put on hold because for some reason their idletime always surpass the max_idletime from parsl.

About the config, I used the same config on another slurm cluster and it was pretty much working correctly (I also suspect that I have some I/O latency or some other issues as the SPC seems rather slow, any advice on how can I benchmark the retrieval of the jobs and the queue?)

I posted the config that I'm currently using:

cluster: slurm_manager        
loglevel: DEBUG 
logfile: qcfractal-manager.log
update_frequency: 180.0 # I updated this 

server:
  fractal_uri: xxxxx
  username: xxxxx
  password: xxxxx
  verify: False


executors:
  slurm_executor:
    type: slurm
    scratch_directory: ${HOME}/project/qcfractal
    walltime: "72:00:00"
    exclusive: false 
    partition: compute
    account: null
    workers_per_node: 20 
    max_nodes: 30
    cores_per_worker: 8         
    memory_per_worker: 8        
    queue_tags:                 
      - '*'
    environments:
      use_manager_environment: False
      conda:  
        - qcfractal-worker 
    worker_init: 
      - source worker-init.sh  # abs path for the init for the worker

with a pretty standard worker-init.sh:

#!/bin/bash

source ~/.bashrc
conda init 

ulimit -s unlimited
mkdir -p "${HOME}/project/qcfractal"
cd "${HOME}/project/qcfractal"
source ${HOME}/conda/bin/activate qcfractal_manager

I know that parsl would require full exclusivity of the node, could it be a big issue currently that I'm not enabling it?

Edit: no info from the logs of the server from DEBUG. Even when the workers are "killed" , the manager is still on

@bennybp
Copy link
Contributor

bennybp commented Oct 11, 2024

Ok I've been thinking about this a bit more. There could be a couple issues, although it's not entirely clear. Below is some semi-random thoughts.

One part is that the slurm jobs might be getting killed out from under parsl. One thing to check is the raw outputs from slurm. Parsl keeps these in the parsl_run_dir (this is configurable in the manager config, but the default is parsl_run_dir, and I believe typically exists in the same directory as the manager config). This should hopefully give hints as to why the slurm jobs are being killed.

The more mysterious part is that the qcfractal server is determining your manager is dead, when it is very much alive still. An update frequency of 10 is a little low, but not so bad (especially if your tasks are small). A heartbeat frequency (on the server config) of 10 is probably a bit low - 60-180 seconds is typically what I run with, with a heartbeat_max_missed of 5. This would allow for networking issues or other random errors lasting 5-10 minutes.

Side note: I realized the docs are a bit wrong. The update frequency is only the time between a qcfractal manager requesting new tasks. It is not related to the heartbeat mechanism at all, which is handled automatically and not configurable on the manager side.

I also suspect that I have some I/O latency or some other issues as the SPC seems rather slow

You can look at the qcfractal server log - with log level DEBUG it will print response times for the accesses as they come in. It's possible some request takes so long that the manager chokes on it (although the heartbeat mechanism and the job fetching mechanism run in different threads).

Two other random ideas:

  1. Are you running your database with its data directory on network storage? I've seen problems with this, although typically only under load. The postgres log may have some complaints
  2. Unlikely, but do you have a multiple job runner processes or servers behind a load balancer or proxy? I made mistake once and had separate job runners with different heartbeat times, so one was always marking managers dead when they weren't.

If you feel comfortable with sharing any logs (either here or by email), I could have a look.

@FNTwin
Copy link
Author

FNTwin commented Oct 12, 2024

Ehy @bennybp , first of all, thank you for all the help that you are giving me.

One thing to check is the raw outputs from slurm.

Yes, I checked the output from both the worker (no issues in the worker logs or manager nor interchange) and the slurm files. The slurm shell file for the compute allocation is correct and (for the worker that get cancelled) the stdout just presents the usual

No action taken.
Found cores : xxx 
Launching worker: 1

While the stderr has a slurmstepd: error: *** job id timestamp ***
Interchange has the request of the scancel of the jobid and sacct agrees that the job was cancelled by my user.

60-180 seconds is typically what I run with, with a heartbeat_max_missed of 5. This would allow for networking issues or other random errors lasting 5-10 minutes.

I'll change them to better values but even increasing the server config heartbeat to around 180 seconds didn't change much (and also the update_frequency of the manager)

You can look at the qcfractal server log - with log level DEBUG it will print response times for the accesses as they come in. It's possible some request takes so long that the manager chokes on it (although the heartbeat mechanism and the job fetching mechanism run in different threads).

On my attempt on checking the DEBUG logs of the manager, I didn't see any response times logged beside the idletime from parsl but maybe I'm looking at the wrong things. I can share partial of the DEBUG logs (or even the full of it but it is 1Gb right now, better if by email).

database with its data directory on network storage?

Database is on the cloud and after some debugging and benchmarking, I found that retrieving task and making post request is basically the bulk of the time and the bottleneck (in comparison of the SPC calculation that is rather fast for now) from the ManagerClient (for a low amount of workers a single post request in return_finished is 3.5k ms for a body of 52 bytes... I will experiment with the data directory on the network in the meantime to remove this issue while I find some way to improve this)

do you have a multiple job runner processes or servers behind a load balancer or proxy?

Nope

One last question, because I'm doing some massive data generation and for the type of data that I'm handling (1 molecule is basically always 1 new entry), I fear that the querying will start to become a bottleneck as the size of the database and the tables will actually grow a lot. Do you think that is something I need to take in account?

@bennybp
Copy link
Contributor

bennybp commented Oct 12, 2024

Interchange has the request of the scancel of the jobid and sacct agrees that the job was cancelled by my user.

I didn't see any response times logged beside the idletime from parsl but maybe I'm looking at the wrong things

I meant the fractal server logs (not the manager log). There's so many log files :)

The server logs should contain something like this, with the response time at the end

[2024-10-12 00:11:58 UTC] (MainProcess     )    DEBUG: fractal_flask_app: GET api: 0 -> 1871780 [605.3ms]
[2024-10-12 00:12:00 UTC] (MainProcess     )    DEBUG: fractal_flask_app: GET api: 0 -> 60 [1.3ms]
[2024-10-12 00:12:00 UTC] (MainProcess     )    DEBUG: fractal_flask_app: POST api: 98 -> 76 [2.2ms]
[2024-10-12 00:12:00 UTC] (MainProcess     )    DEBUG: fractal_flask_app: POST api: 69 -> 23076 [6.5ms]

is 3.5k ms for a body of 52 bytes

That's definitely seems excessive, although 3.5s shouldn't really cause things to break (that tends to happen around 60 seconds). I guess the question is how much slower it is for a 'real' payload. I know there can be a bit of slowness when returning tasks to a busy server, which is something I need to look into.

I can share partial of the DEBUG logs

Whatever you are willing to send. They typically compress nicely, too :)

One last question, because I'm doing some massive data generation and for the type of data that I'm handling

Depends on how massive. We have >100M records and 4-5TB of data in one server, and it works well. There are certain operations that I need to improve (mostly dataset operations related to modifying existing records), so you might hit some rough edges there. We typically keep individual datasets to around 100k entries for that reason.

@FNTwin
Copy link
Author

FNTwin commented Oct 12, 2024

I meant the fractal server logs (not the manager log). There's so many log files :)

Strange thing, my current log of the server in DEBUG don't have that kind of logging. I'll get a view of the db logs tomorrow too.
I'll send the logs but usually it is something like this (didn't find any API response times...the only log from the fractal_flask_app is the successful login):

[2024-10-09 12:37:58 MDT] (MainProcess     )     INFO: alembic.runtime.migration: Context impl PostgresqlImpl.
[2024-10-09 12:37:58 MDT] (MainProcess     )     INFO: alembic.runtime.migration: Will assume transactional DDL.
[2024-10-09 12:37:58 MDT] (MainProcess     )    DEBUG: qcfractal.components.internal_jobs.socket: Job with name iterate_services already found. Not adding
[2024-10-09 12:37:59 MDT] (MainProcess     )    DEBUG: qcfractal.components.internal_jobs.socket: Job with name check_manager_heartbeats already found. Not adding
[2024-10-09 12:37:59 MDT] (Process-2       )    DEBUG: internal_job_runner:46576ba0-b718-4d06-ac13-4d506e78f6d9: checking for jobs
[2024-10-09 12:37:59 MDT] (Process-2       )    DEBUG: internal_job_runner:46576ba0-b718-4d06-ac13-4d506e78f6d9: checking for jobs before date 2024-10-09 18:37:59.394858+00:00
[2024-10-09 12:37:59 MDT] (Process-2       )     INFO: internal_job_runner:46576ba0-b718-4d06-ac13-4d506e78f6d9: running job iterate_services id=4112999 scheduled_date=2024-10-09 18:37:58.078250+00:00
...
[2024-10-10 07:16:04 MDT] (Process-2       )    DEBUG: internal_job_runner:ba5317c4-f867-4c5e-a0fb-4cb3c3c1a8dd: checking for jobs
[2024-10-10 07:16:04 MDT] (Process-2       )    DEBUG: internal_job_runner:ba5317c4-f867-4c5e-a0fb-4cb3c3c1a8dd: checking for jobs before date 2024-10-10 13:16:04.824062+00:00
[2024-10-10 07:16:04 MDT] (Process-2       )    DEBUG: internal_job_runner:ba5317c4-f867-4c5e-a0fb-4cb3c3c1a8dd: no jobs found
[2024-10-10 07:16:04 MDT] (Process-2       )    DEBUG: internal_job_runner:ba5317c4-f867-4c5e-a0fb-4cb3c3c1a8dd: found future job scheduled for 2024-10-10 13:17:04.770762+00:00, waiting up to 59.80 seconds
[2024-10-10 07:16:05 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (0 successful, 10 failed, 0 rejected).
[2024-10-10 07:16:05 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-6d40cf08-3cea-40fa-af2d-938c1550d389.
[2024-10-10 07:16:05 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 414440 414441 414442 414443 414452 414453 414454 414455 414456 414457
[2024-10-10 07:16:07 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (0 successful, 10 failed, 0 rejected).

Edit: ok I think I know why I don't have that logging. I need to use log_access : True

Depends on how massive. We have >100M records and 4-5TB of data in one server, and it works well.

I'm currently splitting my datasets into multiple 75k-ish records datasets but the scaling will go around 1B of entries

@FNTwin
Copy link
Author

FNTwin commented Oct 14, 2024

Small log with log_access : True . Post request for the compute blueprint are quite slow. I don't see any addictional info about why the workers are getting cancelled.

[2024-10-14 08:16:17 MDT] (MainProcess     )     INFO: alembic.runtime.migration: Context impl PostgresqlImpl.
[2024-10-14 08:16:17 MDT] (MainProcess     )     INFO: alembic.runtime.migration: Will assume transactional DDL.
[2024-10-14 08:16:17 MDT] (MainProcess     )    DEBUG: qcfractal.components.internal_jobs.socket: Job with name geolocate_accesses already found. Not adding
[2024-10-14 08:16:17 MDT] (MainProcess     )    DEBUG: qcfractal.components.internal_jobs.socket: Job with name iterate_services already found. Not adding
[2024-10-14 08:16:17 MDT] (MainProcess     )    DEBUG: qcfractal.components.internal_jobs.socket: Job with name check_manager_heartbeats already found. Not adding
[2024-10-14 08:16:17 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:16:17 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:16:17.972834+00:00
[2024-10-14 08:16:18 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: no jobs found
[2024-10-14 08:16:18 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: found future job scheduled for 2024-10-14 14:16:29.661807+00:00, waiting up to 11.32 seconds
[2024-10-14 08:16:29 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: not waiting, found possible job to run
[2024-10-14 08:16:29 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:16:29 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:16:29.789636+00:00
[2024-10-14 08:16:29 MDT] (Process-2       )     INFO: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: running job iterate_services id=4120011 scheduled_date=2024-10-14 14:16:29.661807+00:00
[2024-10-14 08:16:29 MDT] (Process-2       )     INFO: qcfractal.components.services.socket: Iterating on services
[2024-10-14 08:16:30 MDT] (Process-2       )     INFO: qcfractal.components.services.socket: After iteration, now 0 running services. Max is 20
[2024-10-14 08:16:30 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:16:30 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:16:30.660041+00:00
[2024-10-14 08:16:30 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: no jobs found
[2024-10-14 08:16:30 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: found future job scheduled for 2024-10-14 14:17:30.606958+00:00, waiting up to 59.80 seconds
[2024-10-14 08:16:46 MDT] (Process-1       )     INFO: fractal_flask_app: Successful login for user manager
[2024-10-14 08:16:46 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST auth: 48 -> 1126 [433.0ms]
[2024-10-14 08:16:47 MDT] (Process-1       )    DEBUG: fractal_flask_app: GET api: 0 -> 569 [49.8ms]
[2024-10-14 08:16:51 MDT] (Process-1       )    DEBUG: fractal_flask_app: GET api: 0 -> 569 [1.1ms]
[2024-10-14 08:16:51 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 366 -> 3 [252.4ms]
[2024-10-14 08:17:14 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Manager slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95 has claimed 200 new tasks
[2024-10-14 08:17:14 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 331 -> 563608 [21395.6ms]
[2024-10-14 08:17:16 MDT] (Process-1       )    DEBUG: fractal_flask_app: PATCH compute: 104 -> 4 [101.6ms]
[2024-10-14 08:17:30 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: not waiting, found possible job to run
[2024-10-14 08:17:30 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:17:30 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:17:30.778560+00:00
[2024-10-14 08:17:30 MDT] (Process-2       )     INFO: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: running job iterate_services id=4120021 scheduled_date=2024-10-14 14:17:30.606958+00:00
[2024-10-14 08:17:30 MDT] (Process-2       )     INFO: qcfractal.components.services.socket: Iterating on services
[2024-10-14 08:17:31 MDT] (Process-2       )     INFO: qcfractal.components.services.socket: After iteration, now 0 running services. Max is 20
[2024-10-14 08:17:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:17:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:17:31.115107+00:00
[2024-10-14 08:17:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: no jobs found
[2024-10-14 08:17:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: found future job scheduled for 2024-10-14 14:18:13.322619+00:00, waiting up to 42.07 seconds
[2024-10-14 08:17:45 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:17:45 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160154 1160155 1160156 1160157 1160158 1160159 1160160 1160161 1160162 1160163
[2024-10-14 08:17:50 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:17:50 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 122846 -> 156 [4328.5ms]
[2024-10-14 08:17:50 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:17:50 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160164 1160165 1160166 1160167 1160168 1160169 1160170 1160171 1160172 1160173
[2024-10-14 08:17:54 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:17:54 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 112646 -> 156 [3766.4ms]
[2024-10-14 08:17:54 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:17:54 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160174 1160175 1160176 1160177 1160178 1160179 1160180 1160181 1160182 1160183
[2024-10-14 08:17:57 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:17:57 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 102582 -> 156 [3799.5ms]
[2024-10-14 08:17:57 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:17:57 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160184 1160185 1160346 1160347 1160348 1160349 1160350 1160351 1160352 1160353
[2024-10-14 08:18:01 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:18:01 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 130374 -> 156 [3706.7ms]
[2024-10-14 08:18:13 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: not waiting, found possible job to run
[2024-10-14 08:18:13 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:18:13 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:18:13.480425+00:00
[2024-10-14 08:18:13 MDT] (Process-2       )     INFO: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: running job geolocate_accesses id=4120012 scheduled_date=2024-10-14 14:18:13.322619+00:00
[2024-10-14 08:18:13 MDT] (Process-2       )  WARNING: qcfractal.components.serverinfo.socket: GeoIP2 database file not found. Cannot add location data to accesses. May need to wait for the updater job to run
[2024-10-14 08:18:13 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:18:13 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:18:13.733018+00:00
[2024-10-14 08:18:13 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: no jobs found
[2024-10-14 08:18:13 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: found future job scheduled for 2024-10-14 14:18:31.062968+00:00, waiting up to 17.19 seconds
[2024-10-14 08:18:23 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Manager slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95 has claimed 200 new tasks
[2024-10-14 08:18:23 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 331 -> 561740 [21801.1ms]
[2024-10-14 08:18:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: not waiting, found possible job to run
[2024-10-14 08:18:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:18:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:18:31.198271+00:00
[2024-10-14 08:18:31 MDT] (Process-2       )     INFO: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: running job iterate_services id=4120022 scheduled_date=2024-10-14 14:18:31.062968+00:00
[2024-10-14 08:18:31 MDT] (Process-2       )     INFO: qcfractal.components.services.socket: Iterating on services
[2024-10-14 08:18:31 MDT] (Process-2       )     INFO: qcfractal.components.services.socket: After iteration, now 0 running services. Max is 20
[2024-10-14 08:18:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:18:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:18:31.537827+00:00
[2024-10-14 08:18:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: no jobs found
[2024-10-14 08:18:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: found future job scheduled for 2024-10-14 14:19:31.485456+00:00, waiting up to 59.80 seconds
[2024-10-14 08:18:55 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:18:55 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160186 1160187 1160188 1160189 1160190 1160191 1160192 1160193 1160194 1160195
[2024-10-14 08:18:59 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:18:59 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 103246 -> 156 [3941.7ms]
[2024-10-14 08:18:59 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:18:59 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160196 1160197 1160198 1160199 1160200 1160201 1160202 1160203 1160204 1160205
[2024-10-14 08:19:03 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:03 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 117642 -> 156 [3817.4ms]
[2024-10-14 08:19:03 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:03 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160206 1160207 1160208 1160209 1160210 1160211 1160212 1160213 1160214 1160215
[2024-10-14 08:19:07 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:07 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 133510 -> 156 [3783.0ms]
[2024-10-14 08:19:07 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:07 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160216 1160217 1160218 1160219 1160220 1160221 1160222 1160223 1160224 1160225
[2024-10-14 08:19:10 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:10 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 130290 -> 156 [3726.1ms]
[2024-10-14 08:19:10 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:10 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160226 1160227 1160228 1160229 1160230 1160231 1160232 1160233 1160234 1160235
[2024-10-14 08:19:14 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:14 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 134674 -> 156 [3750.5ms]
[2024-10-14 08:19:14 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:14 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160236 1160237 1160238 1160239 1160240 1160241 1160242 1160243 1160244 1160245
[2024-10-14 08:19:18 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:18 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 139926 -> 156 [3696.3ms]
[2024-10-14 08:19:18 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:18 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160246 1160247 1160248 1160249 1160250 1160251 1160252 1160253 1160254 1160255
[2024-10-14 08:19:22 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:22 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 128814 -> 156 [3672.2ms]
[2024-10-14 08:19:22 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:22 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160256 1160257 1160258 1160259 1160260 1160261 1160262 1160263 1160264 1160265
[2024-10-14 08:19:25 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:26 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 136070 -> 156 [3676.2ms]
[2024-10-14 08:19:26 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:26 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160266 1160267 1160268 1160269 1160270 1160271 1160272 1160273 1160274 1160275
[2024-10-14 08:19:29 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:29 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 138334 -> 156 [3674.5ms]
[2024-10-14 08:19:29 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:29 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160276 1160277 1160278 1160279 1160280 1160281 1160282 1160283 1160284 1160285
[2024-10-14 08:19:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: not waiting, found possible job to run
[2024-10-14 08:19:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:19:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:19:31.657060+00:00
[2024-10-14 08:19:31 MDT] (Process-2       )     INFO: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: running job iterate_services id=4120024 scheduled_date=2024-10-14 14:19:31.485456+00:00
[2024-10-14 08:19:31 MDT] (Process-2       )     INFO: qcfractal.components.services.socket: Iterating on services
[2024-10-14 08:19:31 MDT] (Process-2       )     INFO: qcfractal.components.services.socket: After iteration, now 0 running services. Max is 20
[2024-10-14 08:19:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:19:31 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:19:31.995994+00:00
[2024-10-14 08:19:32 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: no jobs found
[2024-10-14 08:19:32 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: found future job scheduled for 2024-10-14 14:20:13.656056+00:00, waiting up to 41.52 seconds
[2024-10-14 08:19:33 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:33 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 136154 -> 156 [3670.8ms]
[2024-10-14 08:19:33 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:33 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160286 1160287 1160288 1160289 1160290 1160291 1160292 1160293 1160294 1160295
[2024-10-14 08:19:37 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:37 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 133750 -> 156 [3758.8ms]
[2024-10-14 08:19:37 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:37 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160296 1160297 1160298 1160299 1160300 1160301 1160302 1160303 1160304 1160305
[2024-10-14 08:19:41 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:41 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 130562 -> 156 [3711.1ms]
[2024-10-14 08:19:41 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:41 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160306 1160307 1160308 1160309 1160310 1160311 1160312 1160313 1160314 1160315
[2024-10-14 08:19:44 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:45 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 138090 -> 156 [3700.6ms]
[2024-10-14 08:19:45 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:45 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160316 1160317 1160318 1160319 1160320 1160321 1160322 1160323 1160324 1160325
[2024-10-14 08:19:48 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:48 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 127962 -> 156 [3670.8ms]
[2024-10-14 08:19:48 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:48 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160326 1160327 1160328 1160329 1160330 1160331 1160332 1160333 1160334 1160335
[2024-10-14 08:19:52 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:52 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 129322 -> 156 [3707.9ms]
[2024-10-14 08:19:52 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:52 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160336 1160337 1160338 1160339 1160340 1160341 1160342 1160343 1160344 1160345
[2024-10-14 08:19:56 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:19:56 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 138518 -> 156 [3677.2ms]
[2024-10-14 08:19:56 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:19:56 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160354 1160360 1160361 1160365 1160366 1160367 1160376 1160377 1160378 1160379
[2024-10-14 08:20:00 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:20:00 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 135006 -> 156 [3723.1ms]
[2024-10-14 08:20:00 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:20:00 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160385 1160386 1160387 1160388 1160389 1160393 1160394 1160406 1160408 1160410
[2024-10-14 08:20:03 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:20:03 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 139082 -> 156 [3764.5ms]
[2024-10-14 08:20:03 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:20:03 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160413 1160422 1160423 1160426 1160428 1160431 1160437 1160440 1160444 1160446
[2024-10-14 08:20:07 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:20:08 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 102554 -> 156 [3986.0ms]
[2024-10-14 08:20:08 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:20:08 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160447 1160450 1160453 1160467 1160469 1160470 1160475 1160476 1160481 1160492
[2024-10-14 08:20:11 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 10 returned tasks (10 successful, 0 failed, 0 rejected).
[2024-10-14 08:20:11 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 134014 -> 156 [3769.6ms]
[2024-10-14 08:20:11 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Received completed tasks from slurm_manager-slclogin1-22379f94-5caf-4ef2-85f3-31482d18ac95.
[2024-10-14 08:20:11 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket:     Task ids: 1160496 1160497 1160498 1160499 1160500 1160501 1160504 1160505
[2024-10-14 08:20:13 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: not waiting, found possible job to run
[2024-10-14 08:20:13 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:20:13 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:20:13.817907+00:00
[2024-10-14 08:20:13 MDT] (Process-2       )     INFO: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: running job geolocate_accesses id=4120023 scheduled_date=2024-10-14 14:20:13.656056+00:00
[2024-10-14 08:20:13 MDT] (Process-2       )  WARNING: qcfractal.components.serverinfo.socket: GeoIP2 database file not found. Cannot add location data to accesses. May need to wait for the updater job to run
[2024-10-14 08:20:14 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs
[2024-10-14 08:20:14 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: checking for jobs before date 2024-10-14 14:20:14.069162+00:00
[2024-10-14 08:20:14 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: no jobs found
[2024-10-14 08:20:14 MDT] (Process-2       )    DEBUG: internal_job_runner:b50e9178-6d62-453e-8455-9567227af5d8: found future job scheduled for 2024-10-14 14:20:31.944013+00:00, waiting up to 17.73 seconds
[2024-10-14 08:20:14 MDT] (Process-1       )     INFO: qcfractal.components.tasks.socket: Processed 8 returned tasks (8 successful, 0 failed, 0 rejected).
[2024-10-14 08:20:15 MDT] (Process-1       )    DEBUG: fractal_flask_app: POST compute: 121540 -> 138 [3027.3ms]

@bennybp
Copy link
Contributor

bennybp commented Oct 16, 2024

Thanks for the logs. I have a bit of a hypothesis now.

The manager code is fairly serial. Updates (returning finished tasks and claiming new tasks) happen on a regular schedule (controlled by update frequency). If these updates take too long, the manager may actually miss sending its heartbeat and the server will assume it is dead.

Here's what I see: For some reason, the time it takes to claim or return tasks is a bit long (which is something I need to look into more).

[2024-10-14 08:19:41 MDT] (Process-1 ) DEBUG: fractal_flask_app: POST compute: 130562 -> 156 [3711.1ms]

That's 3.7 seconds to return 10 tasks. The size of the data sent to the server is 130,562 bytes. This is pretty long for so little data.

So putting it together, it looks like you have lots and lots of small, fast jobs. You are limited to claiming 200 tasks in a single request to the server, and returning 10 in a single request (these are configurable - see manager_tasks_claim and manager_tasks_return)). These limits are per-request, and the manager will automatically make multiple requests (see here) serially until everything is returned. But this blocks the heartbeat from being sent!

Returning tasks or claiming tasks did not count as a heartbeat, but I have a PR #851 that does that. So that should help. But there's still an underlying issue of returning & claiming being kind of slow.

This is still somewhat a hypothesis, but seems logical from what I am seeing in the logs. Or it could be something completely different :)

@FNTwin
Copy link
Author

FNTwin commented Oct 16, 2024

I agree with everything you pointed out (and yes, in this logs I'm calculating some xtb semi empirical values so it is indeed quite fast).

You are limited to claiming 200 tasks in a single request to the server, and returning 10 in a single request (these are configurable - see manager_tasks_claim and manager_tasks_return)).

I already tried beforehand to modify the manager claim and return and I noticed that the API request increase quite a bit. As now, I m claiming 400 taks and returning 20, I'll update soon in the issue the time increase.

Returning tasks or claiming tasks did not count as a heartbeat, but I have a PR #851 that does that. So that should help. But there's still an underlying issue of returning & claiming being kind of slow.

Thank you for pointing this out. I'm now using this branch and update with the results on this.

I'm also testing things out on the parsl side and switching type of launchers and modifying the DataFlowKernel a bit. I'll soon update here with some extra infos and results.

Side note, I should probably start helping a bit with this project 😅

@FNTwin
Copy link
Author

FNTwin commented Oct 17, 2024

I overrided min_blocks in the executor config and switched to a Srun launcher and for now this seems to have a positive impact (no worker getting randomly removed).

@bennybp
Copy link
Contributor

bennybp commented Oct 17, 2024

Sounds like that might help with the Parsl issues. Let me know how it goes!

I've started looking at the task claiming/returning code to see if I can speed it up. I definitely see a way to reduce the time to claim tasks (hopefully by an order of magnitude or more), but that requires a database change. Task returning code might be more difficult, but should also be doable.

@FNTwin
Copy link
Author

FNTwin commented Oct 17, 2024

Sounds like that might help with the Parsl issues. Let me know how it goes!

So far it is looking good. Some slight issues on preemption automatic requeue but the worker are not getting killed.

I've started looking at the task claiming/returning code to see if I can speed it up. I definitely see a way to reduce the time to claim tasks (hopefully by an order of magnitude or more), but that requires a database change. Task returning code might be more difficult, but should also be doable.

Is there any way I can help with it? I am also wondering if trying AlloyDB is worth it instead of a normal PostGresQL.
Also these problems are not really a blocker for real and expensive DFT calculations where the single point wall time > query retrival or submission I guess

@bennybp
Copy link
Contributor

bennybp commented Oct 18, 2024

Any sort of benchmarking would be helpful. I made a quick attempt at getting parts of the server to run with cProfile, but it doesn't run well since waitress seems to cause it to swallow the profiling info. I didn't look too much at it

Attempt here: https://github.com/MolSSI/QCFractal/commits/bench/

For claiming, the issue is that the query is just not optimal (mostly due to the filtering by waiting on the base record table). But I have a fix in mind, which won't take long. Would be nice if you could test it when it's ready (by Monday or so).

Also these problems are not really a blocker for real and expensive DFT calculations where the single point wall time > query retrival or submission I guess

Yes exactly, although it's always nice to see people push the boundaries so that it can be improved!

AlloyDB is worth it instead of a normal PostGresQL

I don't have any experience with that. My guess is that it might not help (because it's the query that's just inefficient) but would be nice to see if QCFractal works on that. It's fairly restricted to things compatible with Postgresql, so AlloyDB might work.

@FNTwin
Copy link
Author

FNTwin commented Oct 18, 2024

For claiming, the issue is that the query is just not optimal (mostly due to the filtering by waiting on the base record table). But I have a fix in mind, which won't take long. Would be nice if you could test it when it's ready (by Monday or so).

Yes that will be done instantly. I'm already running the qcfractal PR for the hearthbeat with some slight modifications.

Any sort of benchmarking would be helpful. I made a quick attempt at getting parts of the server to run with cProfile, but it doesn't run well since waitress seems to cause it to swallow the profiling info. I didn't look too much at it
Attempt here: https://github.com/MolSSI/QCFractal/commits/bench/

In my profiling I used py-spy and it worked ok-ish. I am still trying to have a complete profile and will keep you updated.
Complete as profiling both the qcfractal operations, the qcfractalcompute and all the other more important operations.

Yes exactly, although it's always nice to see people push the boundaries so that it can be improved!

With some time i will also push the number of entries to >1B

@bennybp
Copy link
Contributor

bennybp commented Oct 22, 2024

I've improved the claim behavior in #852 if you would like to try it out and see how much it improves that side!

Next is improving when managers return tasks, which is going to be even a bit difficult. But judging from the times above, also pretty important.

@FNTwin
Copy link
Author

FNTwin commented Oct 22, 2024

Thank you a lot @bennybp ! I will migrate the db and start some testing + profiling with the new branch

Edit: upgraded the db and now submitting some extra jobs

@FNTwin
Copy link
Author

FNTwin commented Oct 23, 2024

With the new branch I'm getting timed out while submitting new calculations (chunked around 35k entries and I keep getting timedout after a bunch of those submissions):

File "/mnt/ps/home/CORP/cristian.gabellini/conda/envs/qmodd/lib/python3.11/site-packages/qcportal/client_base.py", line 339, in _request
    r = self._req_session.send(prep_req, verify=self._verify, timeout=self.timeout)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/ps/home/CORP/cristian.gabellini/conda/envs/qmodd/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/ps/home/CORP/cristian.gabellini/conda/envs/qmodd/lib/python3.11/site-packages/requests/adapters.py", line 713, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='xxxx', port=7777): Read timed out. (read timeout=300)

I'll downgrade the db to check that it is not related to the branch but from some small tests there is a def a speedup in the claiming of the tasks. I'll follow up with proper benchmarking to report the time gain.

@bennybp
Copy link
Contributor

bennybp commented Dec 10, 2024

Are you still getting those timeouts? It's almost certainly unrelated to my recent changes.

I have added some internal batching before, but maybe I missed somewhere that could use it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants