-
Notifications
You must be signed in to change notification settings - Fork 11
Make the 5 minute timeout configurable #121
Comments
The mount timeout was only 1 min, even though the log message said it was 5 mins. Fixes: #122. It seems 5 mins isn't long enough for some ansible runs, even when you increase the ansible forking to equal the number of DAC nodes. For now try bump this to 10 mins, but leave bug #121 for the config idea open for now.
The mount timeout was only 1 min, even though the log message said it was 5 mins. Fixes: #122. It seems 5 mins isn't long enough for some ansible runs, even when you increase the ansible forking to equal the number of DAC nodes. For now try bump this to 10 mins, but leave bug #121 for the config idea open for now.
Hello, I appear to be hitting this issue on a larger stage in.
In my Slurm config I increased the StageInTimeout and StageOutTimeout values but I am assuming these have no impact?
A 5 minute timeout doesn't fit my use case making DACC unfit for purpose. Is there any way to set the timeout values in the dacd or Slurm burst buffer config? Regards, Edit: I installed from the data-acc-v2.6.tgz release where the timeout should be 10 minutes - has there been a regression on this commit? |
Thanks for you feedback. Totally makes sense to make this configurable. I am more than happy to review patches to help with that. We don't have anyone funding further devlopment of this right now, otherwise I would look into that patch myself. The slurm config you have there sounds correct. Certainly Slurm can decide to give up waiting for the dacctl call independently of the DAC timing out, which is currently hardcoded. In v2.6 the ansible timeout has increased to 10 mins, but the SSH command timeout is still 5 mins:
I agree the best approach is to make the above configurable. Moreover, I think only the copy command will want to increase the timeout, as the other commands using this code really want a shorter timeout (with a separate configuration) The test users we worked with made almost no use of the copy functionality, I believe the generally wanted more control, so did the copy work inside their job scripts instead. It is nice to hear about people using the copy feature. It is worth knowing this uses only a basic single node "rsync" copy, and doesn't attempt a more agressive parallel copy, with the idea that the DAC shouldn't apply too much presure on the typically slower filesystem it will be copying from. |
Hello, Thanks for the quick reply. It sounds like a tactical workaround to this is create a buffer pool with a low amount of data and then the first step of the job is to copy the data in rather than having all data transfer in the buffer API. I will suggest my client use that as a quick fix however, it is a little clunky given the capability of burst. The NFS storage is indeed far slower than a parallel file system having been purposely underspeced with the expectation DAC will run all the high speed parallel transactions on compute nodes during job runtime. I'll be keen to see how this develops for a longer term strategic fix with configurable variables in dacd.conf. Regards, |
There is a hardcoded timeout of 5 minutes:
data-acc/internal/pkg/filesystem_impl/ansible.go
Line 243 in fcf9efe
We hit this timeout occasionally so it would be nice to give it a bit more time.
Thanks
The text was updated successfully, but these errors were encountered: