Performance of the Java API with GPU support #4

somerandomguyontheweb · 2019-07-04T14:14:14Z

Hi,

Thanks for creating easy-bert! I'm experimenting with it in a GPU-empowered AWS machine (p2.xlarge with pre-installed deep learning AMI), and I've noticed some differences in performance between the Python and Java versions. Just wanted to let you know – maybe you could advise what I'm doing wrong.

This is my sample code in Python:

sequences = ['First do it', 'then do it right', 'then do it better'] * 20
bert = Bert("https://tfhub.dev/google/bert_multi_cased_L-12_H-768_A-12/1")
with bert:
    for i in range(50):
        _ = bert.embed(sequences, per_token=True)

Assuming the model has already been downloaded from TFHub and cached, the script would claim all available GPU memory (when the Bert object is initialized), and then GPU utilization would quickly go up to 100% (when the sequences are processed). All in all, the script completes in ~30 seconds.

Upon saving the model with bert.save (let's say the path is /path/to/model), I load it inside a Java app – here is the relevant code snippet:

String[] sequencesOriginal = {"First do it", "then do it right", "then do it better"};
String[] sequences = new String[60];
for (int i = 0; i < sequences.length; i += sequencesOriginal.length) {
    System.arraycopy(sequencesOriginal, 0, sequences, i, 3);
}

String pathToModel = "/path/to/model";
try (Bert bert = Bert.load(Paths.get(pathToModel))) {
    for (int i = 0; i < 50; i++) {
        float[][][] output = bert.embedTokens(sequences);
    }
}

Like the Python version, this would take all of the GPU memory, but actual GPU utilization would be very low – no more than occasional spikes. Most of the computation is done on the CPU cores, and the app takes forever to complete. After I rebuilt easy-bert locally, adding a ConfigProto with .setLogDevicePlacement(true) in Bert.load, the log indicates that most nodes in the computation graph are indeed placed on CPU.

Do you have any ideas why could it happen? Is it a Tensorflow issue, or something is wrong with my installation, or could we tweak Bert.load in such a way that the GPU would be utilized properly?

I would greatly appreciate any help. Thanks in advance!

The text was updated successfully, but these errors were encountered:

somerandomguyontheweb · 2019-07-09T14:35:08Z

I figured out the answer – this is probably a Tensorflow issue. After I compiled Tensorflow Java bindings locally with GPU support, easy-bert is able to see XLA_GPU device, and the whole computation graph is placed on the GPU. So closing this, and I'll consider opening a ticket in the Tensorflow repository instead.

tovbinm · 2019-07-10T01:54:43Z

What kind of performance one should expect when running on CPU? I get around ~0.5 sec per sentence with JVM 😭 @somerandomguyontheweb

somerandomguyontheweb · 2019-07-10T07:13:06Z

@tovbinm I get similar performance, ~0.3 seconds per sentence, with 4 vCPU cores. I've been sending batches of several dozen sentences to easy-bert; in this setup, I believe with more CPU cores it should be possible to get proportionally better latency.

tovbinm · 2019-07-10T14:59:51Z

OK, I will try using embedSequences(final String... sequences) and see how it goes...

tovbinm · 2019-07-10T19:55:53Z

I just compared BERT on a single record vs batch execution and the performance is roughly the same.

Single record: ~0.51 sec per record (+/- 0.04 sec) (measured on 2000 records)
Batch of 200 records: ~109 sec per batch (+/- 2 sec) (measured on 10 batches) - (which is ~0.545 sec per record)

robrua · 2019-07-11T03:26:39Z

@somerandomguyontheweb According to the TensorFlow docs (https://www.tensorflow.org/install/lang_java), in order to get GPU support without recompiling you can pull in the following dependencies:

<dependency>
  <groupId>org.tensorflow</groupId>
  <artifactId>libtensorflow</artifactId>
  <version>1.14.0</version>
</dependency>
<dependency>
  <groupId>org.tensorflow</groupId>
  <artifactId>libtensorflow_jni_gpu</artifactId>
  <version>1.14.0</version>
</dependency>

On a related performance note, I'll be eventually adding a dynamic max sequence length optimization per #2 which will hopefully help a lot with the runtime performance in most cases.

@tovbinm Looking at the BERT benchmarks from this project, it seems like you should be seeing performance improvements at that batch size as long as the compute device supports that level of concurrency. I didn't really make any changes to the model & configuration from the TF Hub models that Google provides when creating the Java models, so I'd be interested to see whether this is a limitation of the available parallelism from your CPU cores or if there's some configuration I should change. Have either of you ran a small benchmark test w/a GPU?

robrua · 2019-07-11T03:31:31Z

Oh, actually, it occurs to me this might just be a case of the tokenizer (which is a pure CPU implementation) bottlenecking the processing. I'll look into this and improve/optimize the tokenization if that ends up being the case.

tovbinm · 2019-07-11T04:00:45Z

I haven't yet investigated which part of the code exactly is the bottleneck, but I doubt it's the CPU. During my tests the CPU utilization was roughly around 60%, but most of the memory was occupied 12GB out of 16GB.
So I would expect the process only used a single core of my CPU and I am not sure which knob in TensorFlow to turn in order to parallelize the inference (cause quite frankly I expected it to parallelize by default).

Here is the integration that I did using easy-bert with TransmogrifAI - salesforce/TransmogrifAI#355

We yet to test with GPUs. Hopefully @kinfaikan and @clin-projects would help 😉

somerandomguyontheweb · 2019-07-11T07:52:56Z

@robrua Thanks for your suggestions! Actually I tried including the libtensorflow_jni_gpu Maven dependency (1.13.1 and not 1.14.0, not sure if that matters) into pom.xml of my example app, and also recompiling easy-bert with -P gpu, but that didn't help. Looks like the native library in the libtensorflow_jni_gpu artifact has been built without XLA support, and it prevents the BERT Tensorflow model to be placed on GPU. Maybe something is wrong with my installation – let's see if other people report this issue.

Meanwhile I did some profiling with VisualVM – when running on GPU, this safety check in Tensorflow (invoked e.g. here) is very costly, ~30% of all execution time; locally, I removed it by recompiling libtensorflow.jar. Most of the remaining time is spent in native calls. When running on CPU, native calls execution time vastly dominates over everything else. In my experiments, tokenization doesn't seem to be a bottleneck.

Regarding the dynamic sequence length, I've prototyped a quick-and-dirty solution myself – let me know if you'd like me to share it.

@tovbinm When running on CPU, what I observe is close to 100% utilization of all CPU cores (4 in my case). I expected this behavior to scale up. Maybe I'm missing something – if you have multiple CPU cores, would you mind checking (e.g. with htop) if indeed only a single core is utilized?

somerandomguyontheweb closed this as completed Jul 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of the Java API with GPU support #4

Performance of the Java API with GPU support #4

somerandomguyontheweb commented Jul 4, 2019

somerandomguyontheweb commented Jul 9, 2019

tovbinm commented Jul 10, 2019 •

edited

Loading

somerandomguyontheweb commented Jul 10, 2019

tovbinm commented Jul 10, 2019

tovbinm commented Jul 10, 2019 •

edited

Loading

robrua commented Jul 11, 2019

robrua commented Jul 11, 2019 •

edited

Loading

tovbinm commented Jul 11, 2019 •

edited

Loading

somerandomguyontheweb commented Jul 11, 2019

Performance of the Java API with GPU support #4

Performance of the Java API with GPU support #4

Comments

somerandomguyontheweb commented Jul 4, 2019

somerandomguyontheweb commented Jul 9, 2019

tovbinm commented Jul 10, 2019 • edited Loading

somerandomguyontheweb commented Jul 10, 2019

tovbinm commented Jul 10, 2019

tovbinm commented Jul 10, 2019 • edited Loading

robrua commented Jul 11, 2019

robrua commented Jul 11, 2019 • edited Loading

tovbinm commented Jul 11, 2019 • edited Loading

somerandomguyontheweb commented Jul 11, 2019

tovbinm commented Jul 10, 2019 •

edited

Loading

tovbinm commented Jul 10, 2019 •

edited

Loading

robrua commented Jul 11, 2019 •

edited

Loading

tovbinm commented Jul 11, 2019 •

edited

Loading