-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of the Java API with GPU support #4
Comments
I figured out the answer – this is probably a Tensorflow issue. After I compiled Tensorflow Java bindings locally with GPU support, easy-bert is able to see XLA_GPU device, and the whole computation graph is placed on the GPU. So closing this, and I'll consider opening a ticket in the Tensorflow repository instead. |
What kind of performance one should expect when running on CPU? I get around ~0.5 sec per sentence with JVM 😭 @somerandomguyontheweb |
@tovbinm I get similar performance, ~0.3 seconds per sentence, with 4 vCPU cores. I've been sending batches of several dozen sentences to easy-bert; in this setup, I believe with more CPU cores it should be possible to get proportionally better latency. |
OK, I will try using |
I just compared BERT on a single record vs batch execution and the performance is roughly the same. Single record: ~0.51 sec per record (+/- 0.04 sec) (measured on 2000 records) |
@somerandomguyontheweb According to the TensorFlow docs (https://www.tensorflow.org/install/lang_java), in order to get GPU support without recompiling you can pull in the following dependencies: <dependency>
<groupId>org.tensorflow</groupId>
<artifactId>libtensorflow</artifactId>
<version>1.14.0</version>
</dependency>
<dependency>
<groupId>org.tensorflow</groupId>
<artifactId>libtensorflow_jni_gpu</artifactId>
<version>1.14.0</version>
</dependency> On a related performance note, I'll be eventually adding a dynamic max sequence length optimization per #2 which will hopefully help a lot with the runtime performance in most cases. @tovbinm Looking at the BERT benchmarks from this project, it seems like you should be seeing performance improvements at that batch size as long as the compute device supports that level of concurrency. I didn't really make any changes to the model & configuration from the TF Hub models that Google provides when creating the Java models, so I'd be interested to see whether this is a limitation of the available parallelism from your CPU cores or if there's some configuration I should change. Have either of you ran a small benchmark test w/a GPU? |
Oh, actually, it occurs to me this might just be a case of the tokenizer (which is a pure CPU implementation) bottlenecking the processing. I'll look into this and improve/optimize the tokenization if that ends up being the case. |
I haven't yet investigated which part of the code exactly is the bottleneck, but I doubt it's the CPU. During my tests the CPU utilization was roughly around 60%, but most of the memory was occupied 12GB out of 16GB. Here is the integration that I did using easy-bert with TransmogrifAI - salesforce/TransmogrifAI#355 We yet to test with GPUs. Hopefully @kinfaikan and @clin-projects would help 😉 |
@robrua Thanks for your suggestions! Actually I tried including the Meanwhile I did some profiling with VisualVM – when running on GPU, this safety check in Tensorflow (invoked e.g. here) is very costly, ~30% of all execution time; locally, I removed it by recompiling Regarding the dynamic sequence length, I've prototyped a quick-and-dirty solution myself – let me know if you'd like me to share it. @tovbinm When running on CPU, what I observe is close to 100% utilization of all CPU cores (4 in my case). I expected this behavior to scale up. Maybe I'm missing something – if you have multiple CPU cores, would you mind checking (e.g. with |
Hi,
Thanks for creating easy-bert! I'm experimenting with it in a GPU-empowered AWS machine (p2.xlarge with pre-installed deep learning AMI), and I've noticed some differences in performance between the Python and Java versions. Just wanted to let you know – maybe you could advise what I'm doing wrong.
This is my sample code in Python:
Assuming the model has already been downloaded from TFHub and cached, the script would claim all available GPU memory (when the
Bert
object is initialized), and then GPU utilization would quickly go up to 100% (when the sequences are processed). All in all, the script completes in ~30 seconds.Upon saving the model with
bert.save
(let's say the path is/path/to/model
), I load it inside a Java app – here is the relevant code snippet:Like the Python version, this would take all of the GPU memory, but actual GPU utilization would be very low – no more than occasional spikes. Most of the computation is done on the CPU cores, and the app takes forever to complete. After I rebuilt easy-bert locally, adding a
ConfigProto
with.setLogDevicePlacement(true)
inBert.load
, the log indicates that most nodes in the computation graph are indeed placed on CPU.Do you have any ideas why could it happen? Is it a Tensorflow issue, or something is wrong with my installation, or could we tweak
Bert.load
in such a way that the GPU would be utilized properly?I would greatly appreciate any help. Thanks in advance!
The text was updated successfully, but these errors were encountered: