-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different use case #49
Comments
I envisaged that for heavy compute power stuff, i.e. needing distributed, or a bigger server/GPU for certain models or certain validations or something, then the python thread running kotsu should "dispatch" a request/trigger to something else. We do this in other projects, i.e. where we use airflow workers for running dags, but the tasks dispatch to some big cloud compute/cluster to churn through the stuff and then the airflow work receives the response back from the server. The only bit which doesn't work for this is storing artifacts during model/validation runs. I guess if we got models that beasty, then we'd probs be using cloud bucket for storing stuff anyways, and I think Would those ideas solve the use case you reckon? Would be sick to try it on such a use case if you have one already to see how it goes - obvs I can help out. |
Yeah this sounds like a possibility. I guess e.g. the inner block of the run function is something that can be distributed. Though the results df handling might also become tricky? Would there be a way to do it that required minimal additional code from the user? Would it involve a 'preferred' interface for communicating with the cloud server? In my personal usage it's actually harder in that I don't think I have a way to programatically distribute work on the compute cluster I use (we have a job scheduling system which prevents jobs from triggering other jobs. I think this could potentially be resolved by having some remote server maintaining a queue which can then be queried from individual jobs, but that sounds probably out of scope.) |
Been having a think about this. A few thoughts:
Is this at all what you were thinking of? Or do you models need distributed compute even just to train a single one at a time? |
Oh interesting, I think there might be cases where you want to distribute everything. e.g. imagine if you have a model which takes a day to train and evaluate (i.e. run a single 'validation' even if using a single train/val split, no folds). And say you want to run 10 different hyper parameter settings of that model on the same 'validation'. |
Yea for reals, that's something I was thinking about after writing message also. Since thinking about it some more, I think the following could be a simple(ish?) way of achieving this. That "this" is: parallel/distributed running of the validation-model combinations. Solution: implement, directly into Remarks:
|
Agree maintaining a connection and waiting to receive the result back is probably not viable. Is the reason this is required because we assume validations return metrics in memory? Guess there might be a slight modification that could support it still. e.g. suppose each validation instead just needs to write a single-line metrics csv as well as any other outputs to some results directory. Then we provide some other Could this then fit into the kind of solution you're suggesting? Just to try to picture what this would involve for the user, would the idea in the kind of situation you're imagining be they'd need to write a I guess what that makes me think is that if kotsu provided a command line interface (e.g. I'm not sure at all whether any of this translates to other distributed compute settings (or if I've explained how I'm understanding it properly) Also, FYI I came across a repository with a quite interesting-looking implementation of this kind of workflow, designed for cluster environments, where they abstract away the job submission via a Submitter class. In this repo each job is run by calling |
Was thinking less about it as required and more just as it could be a was of implementing dispatching and receiving results (but one we both think is not so good to try). And yea it seemed a natural first thought as the rest of kotsu works on receiving results directly back from the return value of the
Yea was also thinking this. Could well be a simple(ish/enough?) solution for it.
Yes, I was imagining that the user would need to write the dispatch code within their validation. I think I'm imagining this because the premise of
I guess a CLI is always just a
Nice this looks interesting.
From a glance at this is looks nice. So ye, I guess we could tie kotsu (with an abstract class beforehand) to one/some of https://slurm.schedmd.com/overview.html etc.? Or could one submit the |
Currently kotsu is tailored towards benchmarking in cases where 'validations' (which may include fitting of models) are relatively inexpensive, and can be run serially in a single process.
In a lot of ML settings validations may be expensive and computation may need to be distributed
(e.g. running a set of 'validations' on a single model in a single 'job', distributing jobs across a cluster of CPUs/GPUs).
I don't know whether this setting is in scope but just had the thought so wanted to open this up in case it's something we want to think about.
The text was updated successfully, but these errors were encountered: