Improve the way we allocate and use memory for GPU batch decoding

Currently when we are doing batch decoding, using get frames_at_indexes, we allocate the memory on the host, and then we all allocate each frame independently and then copy the memory into the batch memory.

Memory is allocated here for the batch:

https://github.com/pytorch/torchcodec/blob/f4065f1b477148cfb0ef94167fb0bf3a63803e55/src/torchcodec/decoders/_core/VideoDecoder.cpp#L1021

The copy is done here from frame to batch memory:

https://github.com/pytorch/torchcodec/blob/f4065f1b477148cfb0ef94167fb0bf3a63803e55/src/torchcodec/decoders/_core/VideoDecoder.cpp#L1031

This is wasteful especially when we are doing GPU recording because we incur multiple device to host transfers -- one per frame.

Action items:
1. We should respect the device when we are allocating the batch tensor memory.
2. We should directly use the batch sensor memory instead of incurring an extra memcpy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the way we allocate and use memory for GPU batch decoding #189

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve the way we allocate and use memory for GPU batch decoding #189

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions