Optimize CPU time of JPEG lossless decoder #4625

jantonguirao · 2023-01-31T15:37:15Z

Signed-off-by: Joaquin Anton janton@nvidia.com

Category:

Refactoring*

Description:

Rearranges nvJPEG lossless decoding so that we minimize the CPU time
Parse encoded streams only once
Parallel parsing of the streams
Implement CanDecode on the regular nvJPEG decoder, to fail earlier
Bugfix: nvJPEG lossless should be tried after nvJPEG

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

jantonguirao · 2023-01-31T15:37:54Z

dali/imgcodec/decoders/nvjpeg/nvjpeg.cc

@@ -144,6 +144,19 @@ NvJpegDecoderInstance::PerThreadResources::~PerThreadResources() {
  }
 }

+bool NvJpegDecoderInstance::CanDecode(DecodeContext ctx, ImageSource *in, DecodeParams opts,


By implementing CanDecode, we can detect lossless JPEGs (not supported by this backend), and fail earlier

jantonguirao · 2023-01-31T15:38:51Z

dali/imgcodec/decoders/nvjpeg_lossless/nvjpeg_lossless.cc

-    CUDA_CALL(nvjpegDecodeBatchedSupported(nvjpeg_handle_, jpeg_stream_, &is_supported));
-    return is_supported == 0;
-  } catch (...) {
+  JpegParser jpeg_parser{};


Here, we avoid calling nvjpegJpegStreamParseHeader which is very heavy and we need to do it later again. Instead, we simply look for the SOF-3 marker and leave the parsing to ScheduleDecode

What happens when ScheduleDecode fails? Can we fallback to something else in such case?
Although nvjpegJpegStreamParseHeader is expensive it seems to be the only way to confirm that nvJPEG can handle provided stream.

CanDecode can be optimistic - we do that all the time in other decoders. ScheduleDecode returns partial results and the high-level decoder will redirect the failing samples to a fallback (if there's any).

Understood.

mzient · 2023-01-31T15:45:43Z

dali/imgcodec/image_decoder.cc

 #include <vector>
 #include "dali/core/cuda_error.h"
 #include "dali/core/mm/memory.h"
 #include "dali/imgcodec/image_decoder.h"
 #include "dali/imgcodec/util/output_shape.h"
+#include "dali/core/nvtx.h"
+#include <typeinfo>


Move up or the linter will complain.

mzient · 2023-01-31T15:46:29Z

dali/imgcodec/image_decoder.cc

@@ -219,6 +222,8 @@ class ImageDecoder::DecoderWorker {
  std::shared_ptr<ImageDecoderInstance> decoder_;
  bool produces_gpu_output_ = false;

+  std::string nvtx_marker_str_;


Tiny nitpick - I'd move it where all the bookkeeping stuff is (i.e. after the started_ flag).

Signed-off-by: Joaquin Anton <janton@nvidia.com>

JanuszL · 2023-01-31T17:54:34Z

dali/imgcodec/decoders/nvjpeg_lossless/nvjpeg_lossless.cc

+  if (batch_sz_ <= 0)
+    return;
+  int nsamples = in.size();
+  kernels::DynamicScratchpad s({}, ctx.stream);


I think this breaks. The DynamicScratchpad is destroyed before any postprocessing on it is invoked. Previously it was done in the same scope, now it is not.

True, we've been here before.

This kind of has me wondering if we can detect this somehow (without rewriting half of DALI kernels).

mzient · 2023-01-31T19:52:33Z

dali/imgcodec/decoders/nvjpeg_lossless/nvjpeg_lossless.cc

-    CUDA_CALL(nvjpegDecodeBatched(nvjpeg_handle_, state_, encoded_.data(), encoded_len_.data(),
-                                  decoded_.data(), ctx.stream));
-  } catch (...) {
+  Parse(promise, ctx, in, opts, rois);


You have to create the scratchpad here and pass it to RunDecode - otherwise Postprocess will use deleted memory.

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao · 2023-02-01T09:32:51Z

!build

dali-automaton · 2023-02-01T09:35:21Z

CI MESSAGE: [7168005]: BUILD STARTED

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao · 2023-02-01T11:08:10Z

!build

dali-automaton · 2023-02-01T11:10:21Z

CI MESSAGE: [7168652]: BUILD STARTED

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao · 2023-02-01T12:07:22Z

!build

dali-automaton · 2023-02-01T12:46:13Z

CI MESSAGE: [7169290]: BUILD STARTED

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao · 2023-02-01T14:51:59Z

!build

dali-automaton · 2023-02-01T14:55:14Z

CI MESSAGE: [7170097]: BUILD STARTED

dali-automaton · 2023-02-01T17:32:06Z

CI MESSAGE: [7170097]: BUILD PASSED

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao commented Jan 31, 2023

View reviewed changes

mzient reviewed Jan 31, 2023

View reviewed changes

jantonguirao assigned mzient Jan 31, 2023

Optimize CPU time of JPEG lossless decoder

068abb7

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao force-pushed the nvjpeg_lossless5 branch from 31773a8 to 068abb7 Compare January 31, 2023 17:26

JanuszL reviewed Jan 31, 2023

View reviewed changes

mzient reviewed Jan 31, 2023

View reviewed changes

Bugfix: scratchpad should outlive the call to Postprocess

34cadbd

Signed-off-by: Joaquin Anton <janton@nvidia.com>

JanuszL approved these changes Feb 1, 2023

View reviewed changes

mzient approved these changes Feb 1, 2023

View reviewed changes

JanuszL self-assigned this Feb 1, 2023

Improve NVTX

42ee30e

Signed-off-by: Joaquin Anton <janton@nvidia.com>

clang-only fix

a510e8a

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao force-pushed the nvjpeg_lossless5 branch from f21a920 to a510e8a Compare February 1, 2023 12:07

bugfix: handle dynamic range multiplier

e8b5091

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao merged commit 9f0f7e0 into NVIDIA:main Feb 1, 2023

aderylo pushed a commit to zpp-dali-2022/DALI that referenced this pull request Mar 17, 2023

Optimize CPU time of JPEG lossless decoder (NVIDIA#4625)

2eb3292

Signed-off-by: Joaquin Anton <janton@nvidia.com>

JanuszL mentioned this pull request Sep 6, 2023

Roadmap 2023 #4578

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CPU time of JPEG lossless decoder #4625

Optimize CPU time of JPEG lossless decoder #4625

jantonguirao commented Jan 31, 2023

jantonguirao Jan 31, 2023

jantonguirao Jan 31, 2023

JanuszL Jan 31, 2023

mzient Jan 31, 2023

JanuszL Jan 31, 2023

mzient Jan 31, 2023

mzient Jan 31, 2023

JanuszL Jan 31, 2023 •

edited

Loading

mzient Jan 31, 2023

mzient Feb 1, 2023

mzient Jan 31, 2023

jantonguirao Feb 1, 2023

jantonguirao commented Feb 1, 2023

dali-automaton commented Feb 1, 2023

jantonguirao commented Feb 1, 2023

dali-automaton commented Feb 1, 2023

jantonguirao commented Feb 1, 2023

dali-automaton commented Feb 1, 2023

jantonguirao commented Feb 1, 2023

dali-automaton commented Feb 1, 2023

dali-automaton commented Feb 1, 2023

Optimize CPU time of JPEG lossless decoder #4625

Optimize CPU time of JPEG lossless decoder #4625

Conversation

jantonguirao commented Jan 31, 2023

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JanuszL Jan 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jantonguirao commented Feb 1, 2023

dali-automaton commented Feb 1, 2023

jantonguirao commented Feb 1, 2023

dali-automaton commented Feb 1, 2023

jantonguirao commented Feb 1, 2023

dali-automaton commented Feb 1, 2023

jantonguirao commented Feb 1, 2023

dali-automaton commented Feb 1, 2023

dali-automaton commented Feb 1, 2023

JanuszL Jan 31, 2023 •

edited

Loading