Use two streams, one per FT slice. #126

Sopel97 · 2021-06-08T12:15:58Z

Since the computation of the two slices of the feature transformer output are independent we can try to run them on separate streams. On my GTX750 I can notice a slight performance increase with very small FT sizes, and cuda profiler shows some overlap between kernels.

This may increase performance on beefier GPUs like V100, but that remains to be tested.

Note that we can in fact run two separate streams for backward too, even though they operate on the same output buffer, because all writes are atomic.

vondele · 2021-06-08T13:05:51Z

that looks good to me, exposing the parallelism can only help.

Sopel97 · 2021-06-08T13:15:49Z

On V100 with 1 thread and 1 worker

1 thread, 1 worker:
before: 48.29 at 1000, 47.94 at 2000, 47.84 at 3000
after: 48.29 at 1000, 47.93 at 2000, 47.82 at 3000

8 threads, 4 workers:
before: 57.04 at 1000, 57.04 at 2000, 57.16 at 3000
after: 56.37 at 1000, 56.41 at 2000, 56.52 at 3000

so doesn't help, at least for now. But also doesn't do harm.

vondele · 2021-06-08T13:20:45Z

we kind of now that on V100 and above it is limited by the CPU.

Sopel97 · 2021-06-26T23:02:11Z

I'd like to see some benchmarks from other people before pushing this.

Use two streams, one per FT slice.

a97d133

Sopel97 added the help wanted Extra attention is needed label Jun 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use two streams, one per FT slice. #126

Use two streams, one per FT slice. #126

Sopel97 commented Jun 8, 2021

vondele commented Jun 8, 2021

Sopel97 commented Jun 8, 2021 •

edited

Loading

vondele commented Jun 8, 2021

Sopel97 commented Jun 26, 2021

Use two streams, one per FT slice. #126

Are you sure you want to change the base?

Use two streams, one per FT slice. #126

Conversation

Sopel97 commented Jun 8, 2021

vondele commented Jun 8, 2021

Sopel97 commented Jun 8, 2021 • edited Loading

vondele commented Jun 8, 2021

Sopel97 commented Jun 26, 2021

Sopel97 commented Jun 8, 2021 •

edited

Loading