Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CPU time of JPEG lossless decoder #4625

Merged
merged 5 commits into from
Feb 1, 2023

Conversation

jantonguirao
Copy link
Contributor

Signed-off-by: Joaquin Anton janton@nvidia.com

Category:

Refactoring*

Description:

  • Rearranges nvJPEG lossless decoding so that we minimize the CPU time
  • Parse encoded streams only once
  • Parallel parsing of the streams
  • Implement CanDecode on the regular nvJPEG decoder, to fail earlier
  • Bugfix: nvJPEG lossless should be tried after nvJPEG

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@@ -144,6 +144,19 @@ NvJpegDecoderInstance::PerThreadResources::~PerThreadResources() {
}
}

bool NvJpegDecoderInstance::CanDecode(DecodeContext ctx, ImageSource *in, DecodeParams opts,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By implementing CanDecode, we can detect lossless JPEGs (not supported by this backend), and fail earlier

CUDA_CALL(nvjpegDecodeBatchedSupported(nvjpeg_handle_, jpeg_stream_, &is_supported));
return is_supported == 0;
} catch (...) {
JpegParser jpeg_parser{};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we avoid calling nvjpegJpegStreamParseHeader which is very heavy and we need to do it later again. Instead, we simply look for the SOF-3 marker and leave the parsing to ScheduleDecode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when ScheduleDecode fails? Can we fallback to something else in such case?
Although nvjpegJpegStreamParseHeader is expensive it seems to be the only way to confirm that nvJPEG can handle provided stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CanDecode can be optimistic - we do that all the time in other decoders. ScheduleDecode returns partial results and the high-level decoder will redirect the failing samples to a fallback (if there's any).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood.

#include <vector>
#include "dali/core/cuda_error.h"
#include "dali/core/mm/memory.h"
#include "dali/imgcodec/image_decoder.h"
#include "dali/imgcodec/util/output_shape.h"
#include "dali/core/nvtx.h"
#include <typeinfo>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move up or the linter will complain.

@@ -219,6 +222,8 @@ class ImageDecoder::DecoderWorker {
std::shared_ptr<ImageDecoderInstance> decoder_;
bool produces_gpu_output_ = false;

std::string nvtx_marker_str_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nitpick - I'd move it where all the bookkeeping stuff is (i.e. after the started_ flag).

Signed-off-by: Joaquin Anton <janton@nvidia.com>
if (batch_sz_ <= 0)
return;
int nsamples = in.size();
kernels::DynamicScratchpad s({}, ctx.stream);
Copy link
Contributor

@JanuszL JanuszL Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this breaks. The DynamicScratchpad is destroyed before any postprocessing on it is invoked. Previously it was done in the same scope, now it is not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we've been here before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of has me wondering if we can detect this somehow (without rewriting half of DALI kernels).

CUDA_CALL(nvjpegDecodeBatched(nvjpeg_handle_, state_, encoded_.data(), encoded_len_.data(),
decoded_.data(), ctx.stream));
} catch (...) {
Parse(promise, ctx, in, opts, rois);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to create the scratchpad here and pass it to RunDecode - otherwise Postprocess will use deleted memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7168005]: BUILD STARTED

@JanuszL JanuszL self-assigned this Feb 1, 2023
Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7168652]: BUILD STARTED

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7169290]: BUILD STARTED

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7170097]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7170097]: BUILD PASSED

@jantonguirao jantonguirao merged commit 9f0f7e0 into NVIDIA:main Feb 1, 2023
aderylo pushed a commit to zpp-dali-2022/DALI that referenced this pull request Mar 17, 2023
Signed-off-by: Joaquin Anton <janton@nvidia.com>
@JanuszL JanuszL mentioned this pull request Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants