You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hive worker create AppDomains for each job and store the assemblies for the job in a folder Temp/PluginTemp/{jobGuid}.
When the job is stopped (e.g. for a snapshot), the AppDomain is disposed and the folder with the assemblies is cleared. However, this does not work for native dlls because they cannot be unloaded and the Hive worker process still blocks the native dll. An exception is raised when trying to delete the dll and the folder which is caught by the Hive worker.
The problem arises when the same job is resumed at the same worker. After downloading the job from the server the worker tries to create the folder for the job and write the assemblies. Since this folder and the file still exists another exception is raised (caught again by the Hive worker). However, the job cannot be resumed and will be marked as failed at the Hive server.
To Reproduce
Steps to reproduce the behavior:
Create a GP SymReg job and set Evaluator to "Parameter Optimization Evaluator" (this uses the hl-native-interpreter plugin)
Configure GP run to make sure it takes a few minutes (10)
Run in Hive but select only a single worker
Open job manager, wait for the job to be "running" and then pause the job.
Wait for the job to be paused and resume the job
The job will be stopped with state "Failed". The error message will show a problem with "hl-native-interpreter.dll"
Proposed fix
Check whether the folder for the jobGuid already exists in the Hive worker and reused the existing folder. Additionally check whether plugin files already exist in the folder and do not overwrite those files. Since it is the same job we can reuse the old files.
The text was updated successfully, but these errors were encountered:
gkronber
changed the title
Hive worker cannot process jobs which use native dlls and take longer than 18hours
Hive worker cannot resume jobs which use native dlls after pausing (and after automatic snapshots every 18 hours)
Jun 29, 2022
Describe the bug
Hive worker create AppDomains for each job and store the assemblies for the job in a folder Temp/PluginTemp/{jobGuid}.
When the job is stopped (e.g. for a snapshot), the AppDomain is disposed and the folder with the assemblies is cleared. However, this does not work for native dlls because they cannot be unloaded and the Hive worker process still blocks the native dll. An exception is raised when trying to delete the dll and the folder which is caught by the Hive worker.
The problem arises when the same job is resumed at the same worker. After downloading the job from the server the worker tries to create the folder for the job and write the assemblies. Since this folder and the file still exists another exception is raised (caught again by the Hive worker). However, the job cannot be resumed and will be marked as failed at the Hive server.
To Reproduce
Steps to reproduce the behavior:
Proposed fix
Check whether the folder for the jobGuid already exists in the Hive worker and reused the existing folder. Additionally check whether plugin files already exist in the folder and do not overwrite those files. Since it is the same job we can reuse the old files.
The text was updated successfully, but these errors were encountered: