Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A transfer hangs when two processes use Data Movement Library #290

Open
KamilKalitaRel opened this issue May 5, 2022 · 4 comments
Open

Comments

@KamilKalitaRel
Copy link

KamilKalitaRel commented May 5, 2022

Which service(blob, file) does this issue concern?

Blob (not tested for file)

Which version of the SDK was used?

2.0.4

On which platform were you using? (.Net Framework version or .Net Core version, and OS version)

.NET Core (not tested on .NET Framework)

How can the problem be reproduced? It'd be better if the code caused the problem can be shared.

  1. Create a console app uploading/downloading a big dataset (mine: >20k files, 15GB in total) with the recommended number of ParallelOperations. Sample code: DmLibThinClient.zip
  2. Run two instances of this app simultanously.

What problem was encountered?

  1. After some time, one of the transfers will complete, while the other will hang forever, without any status updates. It stucks on different files.
  2. Resetting the app and running it in separation from other usages of DMLib allows to finish the transfer.
  3. Debugging Data Movement Library showed, that tasks are waiting in the while-loop of FlatDirectoryTransfer.CheckAndPauseEnumeration() and the value of outstandingTasks doesn't change and exceeds MaxTransferConcurrency.

Have you found a mitigation/solution?

No. Also resetting the transfer is not a viable solution for us, as we use DMLib in automated workflows for long-running transfers.

@avichay-kardash
Copy link

Linux image on which the test was performed: Standard D4s v3 (4 vcpus, 16 GiB memory)

@avichay-kardash
Copy link

For Windows a developer laptop was used with the following parameters:

Processor Intel(R) Xeon(R) E-2276M CPU @ 2.80GHz 2.81 GHz
Installed RAM 64.0 GB (63.7 GB usable)
System type 64-bit operating system, x64-based processor
Edition Windows 10 Enterprise
Version 21H2
Installed on ‎10/‎14/‎2021
OS build 19044.1645
Experience Windows Feature Experience Pack 120.2212.4170.0

@mmkowal-rel
Copy link

Issue is also reproducible using single process, this happens very occasionally, I would say randomly, but still..

@KamilKalitaRel
Copy link
Author

We were able to implement a workaround in our fork, by adding a timeout for waiting in the method CheckAndPauseEnumeration() and throwing an exception, that (caught in our app) causes a retry.

But to really unblock the transfer inside the same process, we needed to do two more things after our timeout occurs:

  1. cancel the transfer, so all hanging tasks could finish (otherwise they are still occupying the process, preventing it from generating new tasks),
  2. because the bug prevents tasks from releasing buffers stored in MemoryManager, we implemented a modification, so the buffers are allocated for a specific transfer (in place of the original implementation, where all buffers were stored in a single dictionary inside MemoryManager class). When the timeout occurs, all buffers connected to the stuck transfer are released.

Our tests show it works, but... Please let us know if you see any risks connected to the above solution, because the changes occurred to be more invasive that we thought at the beginning and agreed during discussions with Microsoft team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants