-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize CPU time of JPEG lossless decoder #4625
Conversation
@@ -144,6 +144,19 @@ NvJpegDecoderInstance::PerThreadResources::~PerThreadResources() { | |||
} | |||
} | |||
|
|||
bool NvJpegDecoderInstance::CanDecode(DecodeContext ctx, ImageSource *in, DecodeParams opts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By implementing CanDecode, we can detect lossless JPEGs (not supported by this backend), and fail earlier
CUDA_CALL(nvjpegDecodeBatchedSupported(nvjpeg_handle_, jpeg_stream_, &is_supported)); | ||
return is_supported == 0; | ||
} catch (...) { | ||
JpegParser jpeg_parser{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we avoid calling nvjpegJpegStreamParseHeader
which is very heavy and we need to do it later again. Instead, we simply look for the SOF-3 marker and leave the parsing to ScheduleDecode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when ScheduleDecode
fails? Can we fallback to something else in such case?
Although nvjpegJpegStreamParseHeader
is expensive it seems to be the only way to confirm that nvJPEG can handle provided stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CanDecode can be optimistic - we do that all the time in other decoders. ScheduleDecode returns partial results and the high-level decoder will redirect the failing samples to a fallback (if there's any).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood.
dali/imgcodec/image_decoder.cc
Outdated
#include <vector> | ||
#include "dali/core/cuda_error.h" | ||
#include "dali/core/mm/memory.h" | ||
#include "dali/imgcodec/image_decoder.h" | ||
#include "dali/imgcodec/util/output_shape.h" | ||
#include "dali/core/nvtx.h" | ||
#include <typeinfo> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move up or the linter will complain.
dali/imgcodec/image_decoder.cc
Outdated
@@ -219,6 +222,8 @@ class ImageDecoder::DecoderWorker { | |||
std::shared_ptr<ImageDecoderInstance> decoder_; | |||
bool produces_gpu_output_ = false; | |||
|
|||
std::string nvtx_marker_str_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nitpick - I'd move it where all the bookkeeping stuff is (i.e. after the started_
flag).
Signed-off-by: Joaquin Anton <janton@nvidia.com>
31773a8
to
068abb7
Compare
if (batch_sz_ <= 0) | ||
return; | ||
int nsamples = in.size(); | ||
kernels::DynamicScratchpad s({}, ctx.stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this breaks. The DynamicScratchpad is destroyed before any postprocessing on it is invoked. Previously it was done in the same scope, now it is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, we've been here before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of has me wondering if we can detect this somehow (without rewriting half of DALI kernels).
CUDA_CALL(nvjpegDecodeBatched(nvjpeg_handle_, state_, encoded_.data(), encoded_len_.data(), | ||
decoded_.data(), ctx.stream)); | ||
} catch (...) { | ||
Parse(promise, ctx, in, opts, rois); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to create the scratchpad here and pass it to RunDecode - otherwise Postprocess will use deleted memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Joaquin Anton <janton@nvidia.com>
!build |
CI MESSAGE: [7168005]: BUILD STARTED |
Signed-off-by: Joaquin Anton <janton@nvidia.com>
!build |
CI MESSAGE: [7168652]: BUILD STARTED |
Signed-off-by: Joaquin Anton <janton@nvidia.com>
f21a920
to
a510e8a
Compare
!build |
CI MESSAGE: [7169290]: BUILD STARTED |
Signed-off-by: Joaquin Anton <janton@nvidia.com>
!build |
CI MESSAGE: [7170097]: BUILD STARTED |
CI MESSAGE: [7170097]: BUILD PASSED |
Signed-off-by: Joaquin Anton <janton@nvidia.com>
Signed-off-by: Joaquin Anton janton@nvidia.com
Category:
Refactoring*
Description:
Additional information:
Affected modules and functionalities:
Key points relevant for the review:
Tests:
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A