Sequence Buffer Sampling Performance #180

AlexanderKoch-Koch · 2020-07-30T10:32:41Z

I have some performance issues with the sequence buffers. I have traced it to the extract_sequence function in rlpyt/utils/misc.py. It's implemented with a loop over all batch elements. This seems to be quite slow. But I wasn't able to find torch functions that could replace the python loop.
When I run my RL algo on a V100 the optimization loop spends about 50% of the time in the extract_batch() function.

Has anyone else encountered this problem before and has a solution?

astooke · 2020-08-05T19:06:41Z

Oooh that's not ideal. Let's see if we can speed this up.

Did you profile with CUDA_LAUNCH_BLOCKING=1? Without that, some GPU time is not always accounted correctly, bc it runs asynchronously, and it can look like other parts of the program are more time consuming.

Is it mostly extracting the observations that is slow?

I have also noticed that this function can be time consuming, it's just a lot of data to copy if you're dealing with images. In that case I would think the really advanced thing to do would be to make some parallel & pipelined data loader, like people use in supervised learning. Would need to use the read-write lock as in the asynchronous mode. Might get complicated!

AlexanderKoch-Koch · 2020-08-06T17:31:04Z

CUDA_LAUNCH_BLOCKING doesn't make a difference.

I have tested it with just reading the one large continuous part of the buffer and reshaping it. This is much faster (about 20x). So I think it's not the amount of data but the different positions.

I have also tried creating a list with all the indexes that should be read. And then I used one torch call to read all of them. But this is just as slow as your implementation with a loop.

A parallel data loader would probably be the most elegant solution. But I have given up on it because it got to complicated.

astooke · 2020-08-07T00:28:57Z

Dang, bummer to hear :(

An intermediate solution could be to use the asynchronous runner, so that the sampling runs continuously in one process while optimization runs continuously in another. If sampling is the slower part anyway, then this would hide the memory copy time.

Does it make it so you can't run the experiment?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence Buffer Sampling Performance #180

Sequence Buffer Sampling Performance #180

AlexanderKoch-Koch commented Jul 30, 2020

astooke commented Aug 5, 2020 •

edited

Loading

AlexanderKoch-Koch commented Aug 6, 2020

astooke commented Aug 7, 2020

Sequence Buffer Sampling Performance #180

Sequence Buffer Sampling Performance #180

Comments

AlexanderKoch-Koch commented Jul 30, 2020

astooke commented Aug 5, 2020 • edited Loading

AlexanderKoch-Koch commented Aug 6, 2020

astooke commented Aug 7, 2020

astooke commented Aug 5, 2020 •

edited

Loading