You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, we are using db-operator with ClusterAdmin role, which is overkill.
Need to loop through the code, see exactly what roles do we need and update binding.yaml
enforce validUntil if it does not exists
Algorithm job OOMKilled will forever deserted and block further new job from the same wallet
Describe the bug
As an algorithm jobs OOMKilled because allocated memory were not enough for the whole process, the algorithm job will be deserted in the namespace.
Associated configmap and other kubernetes objects also not removed.
Database table jobs, column status will forever stays as 40: Running algorithm
Thus, when operator-engine trigger db function announce_and_get_sql_pending_jobs, will always return the OOMKilled algorithm job, and no new job from this wallet can be started.
To Reproduce
Steps to reproduce the behavior:
Setup operator-engine with with env var configure to be nCPU: 1 and ramGB: 1
Publish algorithm and dataset that will run more than 10min and progressively use extra memory
Order and start the compute job
Expected behavior
Jobs pod killed gracefully and next subsequent job will able to be run.
Taints selector for pods
Zip outputs folder
Right now, pod-publishing can upload files in /data/outputs/, but not file in sub-folders (ie: /data/outputs/results/1.jpg)
Also, if you have a lot of files, it's cumbersome to download them one by one.
Let's zip entire /data/outputs/ and upload it to storage, so user has to download only one file
The text was updated successfully, but these errors were encountered:
Right now, we are using db-operator with ClusterAdmin role, which is overkill.
Need to loop through the code, see exactly what roles do we need and update binding.yaml
Describe the bug
As an algorithm jobs OOMKilled because allocated memory were not enough for the whole process, the algorithm job will be deserted in the namespace.
Associated configmap and other kubernetes objects also not removed.
Database table jobs, column status will forever stays as 40: Running algorithm
Thus, when operator-engine trigger db function announce_and_get_sql_pending_jobs, will always return the OOMKilled algorithm job, and no new job from this wallet can be started.
To Reproduce
Steps to reproduce the behavior:
Setup operator-engine with with env var configure to be nCPU: 1 and ramGB: 1
Publish algorithm and dataset that will run more than 10min and progressively use extra memory
Order and start the compute job
Expected behavior
Jobs pod killed gracefully and next subsequent job will able to be run.
Right now, pod-publishing can upload files in /data/outputs/, but not file in sub-folders (ie: /data/outputs/results/1.jpg)
Also, if you have a lot of files, it's cumbersome to download them one by one.
Let's zip entire /data/outputs/ and upload it to storage, so user has to download only one file
The text was updated successfully, but these errors were encountered: