Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different use case #49

Open
alex-hh opened this issue Jun 26, 2022 · 7 comments
Open

Different use case #49

alex-hh opened this issue Jun 26, 2022 · 7 comments

Comments

@alex-hh
Copy link
Contributor

alex-hh commented Jun 26, 2022

Currently kotsu is tailored towards benchmarking in cases where 'validations' (which may include fitting of models) are relatively inexpensive, and can be run serially in a single process.

In a lot of ML settings validations may be expensive and computation may need to be distributed
(e.g. running a set of 'validations' on a single model in a single 'job', distributing jobs across a cluster of CPUs/GPUs).

I don't know whether this setting is in scope but just had the thought so wanted to open this up in case it's something we want to think about.

@DBCerigo
Copy link
Contributor

I envisaged that for heavy compute power stuff, i.e. needing distributed, or a bigger server/GPU for certain models or certain validations or something, then the python thread running kotsu should "dispatch" a request/trigger to something else. We do this in other projects, i.e. where we use airflow workers for running dags, but the tasks dispatch to some big cloud compute/cluster to churn through the stuff and then the airflow work receives the response back from the server.

The only bit which doesn't work for this is storing artifacts during model/validation runs. I guess if we got models that beasty, then we'd probs be using cloud bucket for storing stuff anyways, and I think kotsu being about to take tentaclio style URLS (s3://bucket/somefile.jazz) was already in the thinking for future feature.

Would those ideas solve the use case you reckon? Would be sick to try it on such a use case if you have one already to see how it goes - obvs I can help out.

@alex-hh
Copy link
Contributor Author

alex-hh commented Jun 30, 2022

Yeah this sounds like a possibility. I guess e.g. the inner block of the run function is something that can be distributed. Though the results df handling might also become tricky? Would there be a way to do it that required minimal additional code from the user? Would it involve a 'preferred' interface for communicating with the cloud server?

In my personal usage it's actually harder in that I don't think I have a way to programatically distribute work on the compute cluster I use (we have a job scheduling system which prevents jobs from triggering other jobs. I think this could potentially be resolved by having some remote server maintaining a queue which can then be queried from individual jobs, but that sounds probably out of scope.)

@DBCerigo
Copy link
Contributor

DBCerigo commented Jul 5, 2022

Been having a think about this. A few thoughts:

  • On have the actual kotsu loop/code do distribution
    • I kinda think this is a no go, as I can't imagine a way of implementing it that wouldn't put a significant constraint with how validation code worked.
    • Possible something like run taking a dispatch_func param, which when is not None, it would run copies of the dispatch_func for each val-model combo
  • I wasn't actually thinking about parallelising the run of val-model combos by run, I was actually thinking about parallelising the running of training and prediction on folds (assumes that the val uses folds), which would be a lot easier as one could implement whatever they want within their validation func. So each val-model combo is sequentially ran still, but each val-model is a lot faster by using distributed compute or somet.

Is this at all what you were thinking of? Or do you models need distributed compute even just to train a single one at a time?

@alex-hh
Copy link
Contributor Author

alex-hh commented Jul 6, 2022

Oh interesting, I think there might be cases where you want to distribute everything. e.g. imagine if you have a model which takes a day to train and evaluate (i.e. run a single 'validation' even if using a single train/val split, no folds). And say you want to run 10 different hyper parameter settings of that model on the same 'validation'.

@DBCerigo
Copy link
Contributor

Yea for reals, that's something I was thinking about after writing message also. Since thinking about it some more, I think the following could be a simple(ish?) way of achieving this.

That "this" is: parallel/distributed running of the validation-model combinations.
(Note, this is distinct from the parallel/distributed running of the fold-model combinations that we were talking about above.)

Solution: implement, directly into kotsu, by extending run.run, to take a param n_threads (or something similar), and then run.run would map the validation-model combos to those threads. It would then up to the kotsu user to implement dispatching a job to some other compute thing to run that validation-model combo and wait to receive the result back. Note, that this would use multithreading, as it is assumed each thread is dispatching jobs not crunching jobs themselves.

Remarks:

  • For long running validation-model combos (which is pretty much presumed for the use case we are talking about), this idea seems pretty bait, as each thread is going to have to keep some kind of connection to the compute thing the entire time that the job is running, and we are talking about any kind of connection lose solutions or recording of what validation-model jobs have been started. Doesn't sound like it would work great as is.
  • Could be that this is all indicating that kotsu is a lightweight framework for lightweight modelling only? I'm defs not certain on that, nor (so far) want it to go that direction, just thought it worth raising the idea explicitly.
  • I think we already discussed implementing something to enable multiprocessing (not threading), for when running run.run say with small-ish models or on big machines and wanting to make use of lots of CPUs etc.

@alex-hh
Copy link
Contributor Author

alex-hh commented Jul 25, 2022

Agree maintaining a connection and waiting to receive the result back is probably not viable. Is the reason this is required because we assume validations return metrics in memory?

Guess there might be a slight modification that could support it still. e.g. suppose each validation instead just needs to write a single-line metrics csv as well as any other outputs to some results directory. Then we provide some other aggregate_results util function which needs to be run manually when all the jobs are completed to get an in memory df.

Could this then fit into the kind of solution you're suggesting? Just to try to picture what this would involve for the user, would the idea in the kind of situation you're imagining be they'd need to write a validation which was in fact a dispatcher? I guess in the cluster setting the thing that seems quite tricky is that you typically would be submitting a bash script which would contain a list of commands (e.g. activate your environment, run some script). So I think (?) the user would typically need to write a separate script which ran the validation for a particular model, validation combo.

I guess what that makes me think is that if kotsu provided a command line interface (e.g. kotsu update <benchmark_name> <model_id> <validation_id>) then maybe this could be simpler to achieve. (In my case I can't even programatically submit jobs, but if I had this kind of command line interface it'd be easy to run a pair of registries - i.e. a benchmark - by converting the benchmark into a text file containing the lists of model and validation ids and then submitting a job which read a single model, validation pair from this file and executed the corresponding command. This would also require some convention for defining benchmarks - poss similar to the sktime one - in modules in users repositories for kotsu to read, which may or may not be a good idea.)

I'm not sure at all whether any of this translates to other distributed compute settings (or if I've explained how I'm understanding it properly)

Also, FYI I came across a repository with a quite interesting-looking implementation of this kind of workflow, designed for cluster environments, where they abstract away the job submission via a Submitter class. In this repo each job is run by calling evcouplings_runcfg <config_file>, so this is the command that all the submitters are responsible for dispatching.

@DBCerigo
Copy link
Contributor

Agree maintaining a connection and waiting to receive the result back is probably not viable. Is the reason this is required because we assume validations return metrics in memory?

Was thinking less about it as required and more just as it could be a was of implementing dispatching and receiving results (but one we both think is not so good to try). And yea it seemed a natural first thought as the rest of kotsu works on receiving results directly back from the return value of the validation function.

Guess there might be a slight modification that could support it still. e.g. suppose each validation instead just needs to write a single-line metrics csv as well as any other outputs to some results directory. Then we provide some other aggregate_results util function which needs to be run manually when all the jobs are completed to get an in memory df.

Yea was also thinking this. Could well be a simple(ish/enough?) solution for it.

Could this then fit into the kind of solution you're suggesting? Just to try to picture what this would involve for the user, would the idea in the kind of situation you're imagining be they'd need to write a validation which was in fact a dispatcher? I guess in the cluster setting the thing that seems quite tricky is that you typically would be submitting a bash script which would contain a list of commands (e.g. activate your environment, run some script). So I think (?) the user would typically need to write a separate script which ran the validation for a particular model, validation combo.

Yes, I was imagining that the user would need to write the dispatch code within their validation. I think I'm imagining this because the premise of kotsu was that it was just a python package and didn't constrain users with regards to compute systems/anything else - it could be that this is a mistake, and that it could still be good and useful if we tied it to (some) certain cluster compute systems but still left the validation and model entities completely unrestricted. My feeling is that the light weight-ness and generality is what makes me like kotsu. @ali-tny you got any thoughts on this?

I guess what that makes me think is that if kotsu provided a command line interface (e.g. kotsu update <benchmark_name> <model_id> <validation_id>) then maybe this could be simpler to achieve. (In my case I can't even programatically submit jobs, but if I had this kind of command line interface it'd be easy to run a pair of registries - i.e. a benchmark - by converting the benchmark into a text file containing the lists of model and validation ids and then submitting a job which read a single model, validation pair from this file and executed the corresponding command. This would also require some convention for defining benchmarks - poss similar to the sktime one - in modules in users repositories for kotsu to read, which may or may not be a good idea.)

I guess a CLI is always just a __main__ or import click away or something, so I don't think that should be a big discerning difference in functionality.
I can't quiet figure how you're figuring this dispatch version working. A main missing question I have is, how should the code get on the server to actually run the validation-model pair? (That question applies to my own and earlier ideas on this also.)

I'm not sure at all whether any of this translates to other distributed compute settings (or if I've explained how I'm understanding it properly)

Also, FYI I came across a repository with a quite interesting-looking implementation of this kind of workflow, designed for cluster environments, where they abstract away the job submission via a Submitter class. In this repo each job is run by calling evcouplings_runcfg <config_file>, so this is the command that all the submitters are responsible for dispatching.

Nice this looks interesting.

* [their 'run' function](https://github.com/debbiemarkslab/EVcouplings/blob/5de7bab3b5202848ace2e16ff2b1cda5c8edfda6/evcouplings/utils/app.py#L453)

* [base cluster submitter class](https://github.com/debbiemarkslab/EVcouplings/blob/5de7bab3b5202848ace2e16ff2b1cda5c8edfda6/evcouplings/utils/batch.py#L227)

From a glance at this is looks nice. So ye, I guess we could tie kotsu (with an abstract class beforehand) to one/some of https://slurm.schedmd.com/overview.html etc.? Or could one submit the kotsu.run(...) as a job onto the cluster and leverage all the cluster compute that way? i.e. have it reading and writing results from a cloud file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants