Skip to content
This repository has been archived by the owner on Mar 30, 2023. It is now read-only.

Make the 5 minute timeout configurable #121

Open
jsteel44 opened this issue Jan 10, 2020 · 3 comments
Open

Make the 5 minute timeout configurable #121

jsteel44 opened this issue Jan 10, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@jsteel44
Copy link
Contributor

There is a hardcoded timeout of 5 minutes:

timer := time.AfterFunc(time.Minute*5, func() {

We hit this timeout occasionally so it would be nice to give it a bit more time.

Thanks

@jsteel44 jsteel44 added the enhancement New feature or request label Jan 10, 2020
JohnGarbutt added a commit that referenced this issue Jan 15, 2020
The mount timeout was only 1 min, even though the log message said
it was 5 mins. Fixes: #122.

It seems 5 mins isn't long enough for some ansible runs, even when
you increase the ansible forking to equal the number of DAC nodes.
For now try bump this to 10 mins, but leave bug #121 for the config
idea open for now.
JohnGarbutt added a commit that referenced this issue Jan 15, 2020
The mount timeout was only 1 min, even though the log message said
it was 5 mins. Fixes: #122.

It seems 5 mins isn't long enough for some ansible runs, even when
you increase the ansible forking to equal the number of DAC nodes.
For now try bump this to 10 mins, but leave bug #121 for the config
idea open for now.
@ocfmatt
Copy link

ocfmatt commented Jan 20, 2021

Hello,

I appear to be hitting this issue on a larger stage in.

dacd: Time up, waited more than 5 mins to complete.
dacd: Error in remote ssh run: 'bash -c "export DW_JOB_STRIPED='/mnt/dac/206543_job/global' && sudo -g '#1476600005' -u '#1476600005' rsync -r -ospgu --stats /path/to/app/stagein/ \$DW_JOB_STRIPED/"' error: signal: killed

In my Slurm config I increased the StageInTimeout and StageOutTimeout values but I am assuming these have no impact?

scontrol show burst
Name=datawarp DefaultPool=default Granularity=1500GiB TotalSpace=180000GiB FreeSpace=180000GiB UsedSpace=0
  Flags=EnablePersistent,PrivateData
  StageInTimeout=3600 StageOutTimeout=3600 ValidateTimeout=1200 OtherTimeout=1200
  GetSysState=/usr/local/bin/dacctl
  GetSysStatus=/usr/local/bin/dacctl

A 5 minute timeout doesn't fit my use case making DACC unfit for purpose. Is there any way to set the timeout values in the dacd or Slurm burst buffer config?

Regards,
Matt.

Edit: I installed from the data-acc-v2.6.tgz release where the timeout should be 10 minutes - has there been a regression on this commit?

@JohnGarbutt
Copy link
Collaborator

Thanks for you feedback.

Totally makes sense to make this configurable. I am more than happy to review patches to help with that. We don't have anyone funding further devlopment of this right now, otherwise I would look into that patch myself.

The slurm config you have there sounds correct. Certainly Slurm can decide to give up waiting for the dacctl call independently of the DAC timing out, which is currently hardcoded.

In v2.6 the ansible timeout has increased to 10 mins, but the SSH command timeout is still 5 mins:

timer := time.AfterFunc(time.Minute*5, func() {

I agree the best approach is to make the above configurable. Moreover, I think only the copy command will want to increase the timeout, as the other commands using this code really want a shorter timeout (with a separate configuration)

The test users we worked with made almost no use of the copy functionality, I believe the generally wanted more control, so did the copy work inside their job scripts instead. It is nice to hear about people using the copy feature. It is worth knowing this uses only a basic single node "rsync" copy, and doesn't attempt a more agressive parallel copy, with the idea that the DAC shouldn't apply too much presure on the typically slower filesystem it will be copying from.

@ocfmatt
Copy link

ocfmatt commented Jan 20, 2021

Hello,

Thanks for the quick reply.

It sounds like a tactical workaround to this is create a buffer pool with a low amount of data and then the first step of the job is to copy the data in rather than having all data transfer in the buffer API. I will suggest my client use that as a quick fix however, it is a little clunky given the capability of burst.

The NFS storage is indeed far slower than a parallel file system having been purposely underspeced with the expectation DAC will run all the high speed parallel transactions on compute nodes during job runtime.

I'll be keen to see how this develops for a longer term strategic fix with configurable variables in dacd.conf.

Regards,
Matt.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants