-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working example with Keras #2333
Comments
My experience was that Tensorflow graphs didn't like being created in one thread and then executed in another. I'm not sure though. I think that @bnaul may have some experience here. |
Anyone coming here should look at this really nice example that I found super helpful. |
cc @AlbertDeFusco. I think that example doesn't use a distributed cluster yet. |
Correct. I have not gotten it working with distributed |
@AlbertDeFusco Can you give some intuition about when such a strategy might be useful? I'm on a slurm cluster and I'm seeing that my GPU utilization is not 100% but the CPU utilization (single core) is 100%. My thought is that by creating a multi-core dask generator I can feed the keras queue faster. I'm slightly concerned about the amount of overhead involved that might swamp out any performance gains. I'd be interested in wild guesses if this feels like a reasonable use case. |
@bw4sz did you get it working with SLURM? I'm also on an HPC (LSF). Can you share the code that you did to make it work? |
Hi @bw4sz , I was using my generator in the case where I wanted to train a model with data that was larger than available memory. Much like the Dask-ML Incremental wrapper my DaskGenerator provides no parallelization. It is an out-of-memory streaming technique compatible with Keras If your dataset is on a Distributed cluster there are somethings that may help performance. 1) persist the transformed data that goes into the model and 2) set the chunksize/partitions to a size that will fit into memory on the client (gpu). I might recommend using the largest possible chunk sizes, but I have no evidence to back this up. I was not using the Keras .fit() multiprocessing because it caused errors. Here's a helpful quote from a good article on Keras fit_generator
If your goal is to predict in parallel with an already trained model, there may be a way to utilize Distributed, but it might require some initialization per worker. This stackoverflow reply may give you some inspiration to develop a procedure similar to the way dask-xgboost works. |
awesome. Still working on this, i'm also experimenting with tf.dataset with
tfrecords instead of fit_generator. Either way, dask may be useful here.
…On Thu, Nov 7, 2019 at 9:26 AM Albert DeFusco ***@***.***> wrote:
Hi @bw4sz <https://github.com/bw4sz> , I was using my generator in the
case where I wanted to train a model with data that was larger than
available memory.
Much like the Dask-ML Incremental wrapper my DaskGenerator provides no
parallelization. It is an out-of-memory streaming technique compatible with
Keras .fit(). While .fit() can do multiprocessing, I was not able to get
it to work.
If your dataset is on a Distributed cluster there are somethings that may
help performance. 1) persist the transformed data that goes into the model
and 2) set the chunksize/partitions to a size that will fit into memory on
the client (gpu). I might recommend using the largest possible chunk sizes,
but I have no evidence to back this up.
I was not using the Keras .fit() multiprocessing because it caused errors.
Here's a helpful quote from a good article on Keras fit_generator
<https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly>
Note that our implementation enables the use of the multiprocessing
argument of fit_generator, where the number of threads specified in
n_workers are those that generate batches in parallel. A high enough number
of workers assures that CPU computations are efficiently managed, i.e. that
the bottleneck is indeed the neural network's forward and backward
operations on the GPU (and not data generation).
If your goal is to predict in parallel with an already trained model,
there may be a way to utilize Distributed, but it might require some
initialization per worker. This stackoverflow reply may give you some
inspiration to develop a procedure similar to the way dask-xgboost works.
https://stackoverflow.com/a/49133682
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2333?email_source=notifications&email_token=AAJHBLBPKKTL5ULT42P5ZVLQSRFWDA5CNFSM4GBLW3I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDNFVSI#issuecomment-551181001>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJHBLCSLH4SV5ZEPKLDSI3QSRFWDANCNFSM4GBLW3IQ>
.
--
Ben Weinstein, Ph.D.
Postdoctoral Fellow
University of Florida
http://benweinstein.weebly.com/
|
I have a working prediction example with keras for those who come back here.
This fails:
this succeeds
|
I also wanted to add here that if you are repeatedly loading a model on a worker to perform prediction (I could not get the keras seralize model to work on GPU), make sure to clear the tensorflow backend each time or else you will see a steady scary growth in memory until it spills.
calling gc.collect() had no effect, you must clear the session. |
What is batch_array? You don't define it in your code. Where is x_test used? |
I saw this too, it was cut out from a ipython notebook. https://github.com/weecology/NEON_crown_maps/blob/master/dask_keras_example.ipynb
edited above. I was reviewing this and it still needs more thought. Yes it runs, but the predict function would need to be pretty slow to make dask useful here. |
Thanks very much for confirming! I guessed that that's what it was and I got it working as well. Even if there isn't a speed-up, it helps offload memory usage from an API we deployed with 2GB memory restriction. |
I have issues running Keras models with Dask when using multiple workers.
Is there any minimal working example?
I try this code:
gives error message:
If I run scheduler in the command line:
and replace in the code
then I get this:
The text was updated successfully, but these errors were encountered: