Make the 5 minute timeout configurable #121

jsteel44 · 2020-01-10T08:29:15Z

There is a hardcoded timeout of 5 minutes:

data-acc/internal/pkg/filesystem_impl/ansible.go

Line 243 in fcf9efe

timer := time.AfterFunc(time.Minute*5, func() {

We hit this timeout occasionally so it would be nice to give it a bit more time.

Thanks

The mount timeout was only 1 min, even though the log message said it was 5 mins. Fixes: #122. It seems 5 mins isn't long enough for some ansible runs, even when you increase the ansible forking to equal the number of DAC nodes. For now try bump this to 10 mins, but leave bug #121 for the config idea open for now.

ocfmatt · 2021-01-20T15:06:59Z

Hello,

I appear to be hitting this issue on a larger stage in.

dacd: Time up, waited more than 5 mins to complete.
dacd: Error in remote ssh run: 'bash -c "export DW_JOB_STRIPED='/mnt/dac/206543_job/global' && sudo -g '#1476600005' -u '#1476600005' rsync -r -ospgu --stats /path/to/app/stagein/ \$DW_JOB_STRIPED/"' error: signal: killed

In my Slurm config I increased the StageInTimeout and StageOutTimeout values but I am assuming these have no impact?

scontrol show burst
Name=datawarp DefaultPool=default Granularity=1500GiB TotalSpace=180000GiB FreeSpace=180000GiB UsedSpace=0
  Flags=EnablePersistent,PrivateData
  StageInTimeout=3600 StageOutTimeout=3600 ValidateTimeout=1200 OtherTimeout=1200
  GetSysState=/usr/local/bin/dacctl
  GetSysStatus=/usr/local/bin/dacctl

A 5 minute timeout doesn't fit my use case making DACC unfit for purpose. Is there any way to set the timeout values in the dacd or Slurm burst buffer config?

Regards,
Matt.

Edit: I installed from the data-acc-v2.6.tgz release where the timeout should be 10 minutes - has there been a regression on this commit?

JohnGarbutt · 2021-01-20T15:52:57Z

Thanks for you feedback.

Totally makes sense to make this configurable. I am more than happy to review patches to help with that. We don't have anyone funding further devlopment of this right now, otherwise I would look into that patch myself.

The slurm config you have there sounds correct. Certainly Slurm can decide to give up waiting for the dacctl call independently of the DAC timing out, which is currently hardcoded.

In v2.6 the ansible timeout has increased to 10 mins, but the SSH command timeout is still 5 mins:

data-acc/internal/pkg/filesystem_impl/mount.go

Line 278 in 4e890f4

timer := time.AfterFunc(time.Minute*5, func() {

I agree the best approach is to make the above configurable. Moreover, I think only the copy command will want to increase the timeout, as the other commands using this code really want a shorter timeout (with a separate configuration)

The test users we worked with made almost no use of the copy functionality, I believe the generally wanted more control, so did the copy work inside their job scripts instead. It is nice to hear about people using the copy feature. It is worth knowing this uses only a basic single node "rsync" copy, and doesn't attempt a more agressive parallel copy, with the idea that the DAC shouldn't apply too much presure on the typically slower filesystem it will be copying from.

ocfmatt · 2021-01-20T16:17:25Z

Hello,

Thanks for the quick reply.

It sounds like a tactical workaround to this is create a buffer pool with a low amount of data and then the first step of the job is to copy the data in rather than having all data transfer in the buffer API. I will suggest my client use that as a quick fix however, it is a little clunky given the capability of burst.

The NFS storage is indeed far slower than a parallel file system having been purposely underspeced with the expectation DAC will run all the high speed parallel transactions on compute nodes during job runtime.

I'll be keen to see how this develops for a longer term strategic fix with configurable variables in dacd.conf.

Regards,
Matt.

jsteel44 added the enhancement New feature or request label Jan 10, 2020

JohnGarbutt mentioned this issue Jan 15, 2020

Increase timeouts for mount and ansible #124

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the 5 minute timeout configurable #121

Make the 5 minute timeout configurable #121

jsteel44 commented Jan 10, 2020

ocfmatt commented Jan 20, 2021 •

edited

Loading

JohnGarbutt commented Jan 20, 2021

ocfmatt commented Jan 20, 2021

Make the 5 minute timeout configurable #121

Make the 5 minute timeout configurable #121

Comments

jsteel44 commented Jan 10, 2020

ocfmatt commented Jan 20, 2021 • edited Loading

JohnGarbutt commented Jan 20, 2021

ocfmatt commented Jan 20, 2021

ocfmatt commented Jan 20, 2021 •

edited

Loading