Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of the Java API with GPU support #4

Closed
somerandomguyontheweb opened this issue Jul 4, 2019 · 9 comments
Closed

Performance of the Java API with GPU support #4

somerandomguyontheweb opened this issue Jul 4, 2019 · 9 comments

Comments

@somerandomguyontheweb
Copy link

Hi,

Thanks for creating easy-bert! I'm experimenting with it in a GPU-empowered AWS machine (p2.xlarge with pre-installed deep learning AMI), and I've noticed some differences in performance between the Python and Java versions. Just wanted to let you know – maybe you could advise what I'm doing wrong.

This is my sample code in Python:

sequences = ['First do it', 'then do it right', 'then do it better'] * 20
bert = Bert("https://tfhub.dev/google/bert_multi_cased_L-12_H-768_A-12/1")
with bert:
    for i in range(50):
        _ = bert.embed(sequences, per_token=True)

Assuming the model has already been downloaded from TFHub and cached, the script would claim all available GPU memory (when the Bert object is initialized), and then GPU utilization would quickly go up to 100% (when the sequences are processed). All in all, the script completes in ~30 seconds.

Upon saving the model with bert.save (let's say the path is /path/to/model), I load it inside a Java app – here is the relevant code snippet:

String[] sequencesOriginal = {"First do it", "then do it right", "then do it better"};
String[] sequences = new String[60];
for (int i = 0; i < sequences.length; i += sequencesOriginal.length) {
    System.arraycopy(sequencesOriginal, 0, sequences, i, 3);
}

String pathToModel = "/path/to/model";
try (Bert bert = Bert.load(Paths.get(pathToModel))) {
    for (int i = 0; i < 50; i++) {
        float[][][] output = bert.embedTokens(sequences);
    }
}

Like the Python version, this would take all of the GPU memory, but actual GPU utilization would be very low – no more than occasional spikes. Most of the computation is done on the CPU cores, and the app takes forever to complete. After I rebuilt easy-bert locally, adding a ConfigProto with .setLogDevicePlacement(true) in Bert.load, the log indicates that most nodes in the computation graph are indeed placed on CPU.

Do you have any ideas why could it happen? Is it a Tensorflow issue, or something is wrong with my installation, or could we tweak Bert.load in such a way that the GPU would be utilized properly?

I would greatly appreciate any help. Thanks in advance!

@somerandomguyontheweb
Copy link
Author

I figured out the answer – this is probably a Tensorflow issue. After I compiled Tensorflow Java bindings locally with GPU support, easy-bert is able to see XLA_GPU device, and the whole computation graph is placed on the GPU. So closing this, and I'll consider opening a ticket in the Tensorflow repository instead.

@tovbinm
Copy link

tovbinm commented Jul 10, 2019

What kind of performance one should expect when running on CPU? I get around ~0.5 sec per sentence with JVM 😭 @somerandomguyontheweb

@somerandomguyontheweb
Copy link
Author

@tovbinm I get similar performance, ~0.3 seconds per sentence, with 4 vCPU cores. I've been sending batches of several dozen sentences to easy-bert; in this setup, I believe with more CPU cores it should be possible to get proportionally better latency.

@tovbinm
Copy link

tovbinm commented Jul 10, 2019

OK, I will try using embedSequences(final String... sequences) and see how it goes...

@tovbinm
Copy link

tovbinm commented Jul 10, 2019

I just compared BERT on a single record vs batch execution and the performance is roughly the same.

Single record: ~0.51 sec per record (+/- 0.04 sec) (measured on 2000 records)
Batch of 200 records: ~109 sec per batch (+/- 2 sec) (measured on 10 batches) - (which is ~0.545 sec per record)

@robrua
Copy link
Owner

robrua commented Jul 11, 2019

@somerandomguyontheweb According to the TensorFlow docs (https://www.tensorflow.org/install/lang_java), in order to get GPU support without recompiling you can pull in the following dependencies:

<dependency>
  <groupId>org.tensorflow</groupId>
  <artifactId>libtensorflow</artifactId>
  <version>1.14.0</version>
</dependency>
<dependency>
  <groupId>org.tensorflow</groupId>
  <artifactId>libtensorflow_jni_gpu</artifactId>
  <version>1.14.0</version>
</dependency>

On a related performance note, I'll be eventually adding a dynamic max sequence length optimization per #2 which will hopefully help a lot with the runtime performance in most cases.

@tovbinm Looking at the BERT benchmarks from this project, it seems like you should be seeing performance improvements at that batch size as long as the compute device supports that level of concurrency. I didn't really make any changes to the model & configuration from the TF Hub models that Google provides when creating the Java models, so I'd be interested to see whether this is a limitation of the available parallelism from your CPU cores or if there's some configuration I should change. Have either of you ran a small benchmark test w/a GPU?

@robrua
Copy link
Owner

robrua commented Jul 11, 2019

Oh, actually, it occurs to me this might just be a case of the tokenizer (which is a pure CPU implementation) bottlenecking the processing. I'll look into this and improve/optimize the tokenization if that ends up being the case.

@tovbinm
Copy link

tovbinm commented Jul 11, 2019

I haven't yet investigated which part of the code exactly is the bottleneck, but I doubt it's the CPU. During my tests the CPU utilization was roughly around 60%, but most of the memory was occupied 12GB out of 16GB.
So I would expect the process only used a single core of my CPU and I am not sure which knob in TensorFlow to turn in order to parallelize the inference (cause quite frankly I expected it to parallelize by default).

Here is the integration that I did using easy-bert with TransmogrifAI - salesforce/TransmogrifAI#355

We yet to test with GPUs. Hopefully @kinfaikan and @clin-projects would help 😉

@somerandomguyontheweb
Copy link
Author

@robrua Thanks for your suggestions! Actually I tried including the libtensorflow_jni_gpu Maven dependency (1.13.1 and not 1.14.0, not sure if that matters) into pom.xml of my example app, and also recompiling easy-bert with -P gpu, but that didn't help. Looks like the native library in the libtensorflow_jni_gpu artifact has been built without XLA support, and it prevents the BERT Tensorflow model to be placed on GPU. Maybe something is wrong with my installation – let's see if other people report this issue.

Meanwhile I did some profiling with VisualVM – when running on GPU, this safety check in Tensorflow (invoked e.g. here) is very costly, ~30% of all execution time; locally, I removed it by recompiling libtensorflow.jar. Most of the remaining time is spent in native calls. When running on CPU, native calls execution time vastly dominates over everything else. In my experiments, tokenization doesn't seem to be a bottleneck.

Regarding the dynamic sequence length, I've prototyped a quick-and-dirty solution myself – let me know if you'd like me to share it.

@tovbinm When running on CPU, what I observe is close to 100% utilization of all CPU cores (4 in my case). I expected this behavior to scale up. Maybe I'm missing something – if you have multiple CPU cores, would you mind checking (e.g. with htop) if indeed only a single core is utilized?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants