Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C2D.2: things to consider from old issues #429

Open
mihaisc opened this issue May 14, 2024 · 0 comments
Open

C2D.2: things to consider from old issues #429

mihaisc opened this issue May 14, 2024 · 0 comments
Labels
Type: Enhancement New feature or request
Milestone

Comments

@mihaisc
Copy link
Contributor

mihaisc commented May 14, 2024

  • Check pod queue in case of lack of resources
  • How to deal with ErrImagePull for algo
  • Kubernetes privileges

Right now, we are using db-operator with ClusterAdmin role, which is overkill.
Need to loop through the code, see exactly what roles do we need and update binding.yaml

  • enforce validUntil if it does not exists
  • Algorithm job OOMKilled will forever deserted and block further new job from the same wallet

Describe the bug
As an algorithm jobs OOMKilled because allocated memory were not enough for the whole process, the algorithm job will be deserted in the namespace.
Associated configmap and other kubernetes objects also not removed.
Database table jobs, column status will forever stays as 40: Running algorithm
Thus, when operator-engine trigger db function announce_and_get_sql_pending_jobs, will always return the OOMKilled algorithm job, and no new job from this wallet can be started.

To Reproduce
Steps to reproduce the behavior:

Setup operator-engine with with env var configure to be nCPU: 1 and ramGB: 1
Publish algorithm and dataset that will run more than 10min and progressively use extra memory
Order and start the compute job
Expected behavior
Jobs pod killed gracefully and next subsequent job will able to be run.

  • Taints selector for pods
  • Zip outputs folder

Right now, pod-publishing can upload files in /data/outputs/, but not file in sub-folders (ie: /data/outputs/results/1.jpg)

Also, if you have a lot of files, it's cumbersome to download them one by one.

Let's zip entire /data/outputs/ and upload it to storage, so user has to download only one file

@mihaisc mihaisc added the Type: Enhancement New feature or request label May 14, 2024
@alexcos20 alexcos20 added this to the New C2D milestone Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants