Closed
Description
Currently when we are doing batch decoding, using get frames_at_indexes, we allocate the memory on the host, and then we all allocate each frame independently and then copy the memory into the batch memory.
Memory is allocated here for the batch:
The copy is done here from frame to batch memory:
This is wasteful especially when we are doing GPU recording because we incur multiple device to host transfers -- one per frame.
Action items:
- We should respect the device when we are allocating the batch tensor memory.
- We should directly use the batch sensor memory instead of incurring an extra memcpy.