-
Notifications
You must be signed in to change notification settings - Fork 629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VideoReaderDecoder GPU #3668
Conversation
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
|
||
// TODO(awolant): Extract decoding outside of ReadSample (ReaderDecoder abstraction) | ||
for (int i = 0; i < sequence_len_; ++i) { | ||
// TODO(awolant): This seek can be optimized - for consecutive frames not needed etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can optimize the seek itself and keep it here even with the optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the comment to Seek
. This will be done as DALI-2320.
} | ||
|
||
void VideoLoaderDecoderGpu::PrepareMetadataImpl() { | ||
video_files_.reserve(filenames_.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So we have input files number amount of FramesDecoderGpu instances (including decoder instances inside).
I'm not sure how many of them we can have in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solving this properly is part of DALI-2321 to be done when we have benchmark (DALI-2594). Before it is hard too tell anything about performance impact of any possible solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not about the perf, rather about resource constrains. I think creating 1000 decoders and parsers will consume a lot of resources.
Also we have already hit a maximum amount of files opened in parallel in the old VideoReader (libaviutil).
output_shape.resize(batch_size); | ||
|
||
for (int sample_id = 0; sample_id < batch_size; ++sample_id) { | ||
auto &sample = current_batch[sample_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto &sample = current_batch[sample_id]; | |
auto &sample = GetSample(sample_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
void VideoReaderDecoderGpu::RunImpl(DeviceWorkspace &ws) { | ||
auto &video_output = ws.Output<GPUBackend>(0); | ||
auto ¤t_batch = prefetched_batch_queue_[curr_batch_consumer_]; | ||
int batch_size = current_batch.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int batch_size = current_batch.size(); | |
int batch_size = GetCurrBatchSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
video_output.Resize(output_shape, current_batch[0]->data_.type()); | ||
|
||
for (int sample_id = 0; sample_id < batch_size; ++sample_id) { | ||
auto &sample = current_batch[sample_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto &sample = current_batch[sample_id]; | |
auto &sample =GetSample(sample_id);; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
output_shape.set_tensor_shape(sample_id, sample->data_.shape()); | ||
} | ||
|
||
video_output.Resize(output_shape, current_batch[0]->data_.type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
video_output.Resize(output_shape, current_batch[0]->data_.type()); | |
video_output.Resize(output_shape, GetSample(0);->data_.type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
output_shape.set_tensor_shape(sample_id, sample->data_.shape()); | ||
} | ||
|
||
video_output.Resize(output_shape, current_batch[0]->data_.type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add SetupImpl and deal with shapes there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved decoding the the RunImpl. This means that we do not have the shape of the output ready during SetupImpl
.
We could pass the size from the FramesDecoder
, through VideoSampleDesc
but I don't want to do this as this will limit us in the future, when we optimize index building and might not know size without decoding.
When this is more or less feature complete, we can revisit this refactoring, ok?
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
this->SaveFrame( | ||
frame_cpu.data(), | ||
i, | ||
sample_id, | ||
sequence_id, | ||
"/home/wazka/Downloads/frames/reader/", | ||
this->Width(video_idx), | ||
this->Height(video_idx)); | ||
|
||
this->SaveFrame( | ||
this->GetVfrFrame(video_idx, gt_frame_id + i * stride), | ||
i, | ||
sample_id, | ||
sequence_id, | ||
"/home/wazka/Downloads/frames/gt/", | ||
this->Width(video_idx), | ||
this->Height(video_idx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have that stashed for debugging. Removed
Signed-off-by: Albert Wolant <awolant@nvidia.com>
@@ -66,8 +66,12 @@ class DLL_PUBLIC FramesDecoderGpu : public FramesDecoder { | |||
|
|||
int NextFramePts() { return index_[NextFrameIdx()].pts; } | |||
|
|||
void SetCudaStream(cudaStream_t stream) { stream_ = stream; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you use it anywhere now? Or we keep going on the default stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, removed this. Stream is set during the construction now.
Signed-off-by: Albert Wolant <awolant@nvidia.com>
!build |
CI MESSAGE: [3985100]: BUILD STARTED |
// TODO(awolant): Check per decoder stream | ||
cudaStream_t stream; | ||
DeviceGuard dg(device_id_); | ||
CUDA_CALL(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using CUDAStream
, or even better, just lease one from the pool:
dali::CUDAStreamPool::instance().Get(device_id_);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use CUDAStream
as this is derived from UniqueHandle
and I want to share this stream between decoders for now.
} | ||
|
||
auto &labels_output = ws.Output<GPUBackend>(1); | ||
vector<int> labels_cpu(batch_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SmallVector<int, 256> will save you an allocation for most batch size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I used smaller value, as batch sizes in video use cases tend to be smaller.
Signed-off-by: Albert Wolant <awolant@nvidia.com>
CI MESSAGE: [3985100]: BUILD PASSED |
Signed-off-by: Albert Wolant <awolant@nvidia.com>
!build |
CI MESSAGE: [3985950]: BUILD STARTED |
CI MESSAGE: [3985950]: BUILD FAILED |
!build |
CI MESSAGE: [3987291]: BUILD STARTED |
CI MESSAGE: [3987291]: BUILD FAILED |
!build |
CI MESSAGE: [3988097]: BUILD STARTED |
CI MESSAGE: [3988097]: BUILD PASSED |
* Add VideoReaderDecoderGpu op Signed-off-by: Albert Wolant <awolant@nvidia.com>
* Add VideoReaderDecoderGpu op Signed-off-by: Albert Wolant <awolant@nvidia.com>
* Add VideoReaderDecoderGpu op Signed-off-by: Albert Wolant <awolant@nvidia.com>
Category:
New feature: Adds
VideoReaderDecoderGpu
op. This operator reads and decodes video files using NVDECODE API. It supports both CFR and VFR videos.It provides basic functionality for now. Additional features: more formats, codecs, output types, input variants will be added in subsequent tasks
Description:
Additional information:
Affected modules and functionalities:
Added new operator and loader. Adjusted
FramesDecoderGpu
as some minor changes were needed (ability to set the stream after construction).Key points relevant for the review:
Does this operator properly interface with
FramesDeocderGpu
?Does this operator properly implement DALI
Reader
abstraction, given that it does not exactly fit to it?Checklist
Tests
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-2593