Improve performance of experimental.resize #5662

banasraf · 2024-10-07T12:14:06Z

Category:

Refactoring

Description:

This PR improves experimental.resize operator to reduce the CPU overhead of the operator. It contains the following improvements (from most to least significant):

using two workspace objects for the CV-CUDA op to get rid of synchronising between minibatches
using custom allocator (taking memory from scratchpad) for the input/output TensorBatches
improved method for creating input/output TensorBatch objects: moving all common parts out of the loop
removed auxiliary TensorLists in favour of using the original i/os directly with PushFramesToBatch method

Additional information:

Affected modules and functionalities:

experimental.resize, nvcvop.h/cc

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

Signed-off-by: Rafal Banas <rbanas@nvidia.com>

banasraf · 2024-10-07T12:18:22Z

!build

dali-automaton · 2024-10-07T12:21:13Z

CI MESSAGE: [19099938]: BUILD STARTED

mzient · 2024-10-07T13:13:23Z

dali/operators/image/resize/experimental/resize_op_impl_cvcuda.h

+  TensorList<GPUBackend> in_frames_;
+  TensorList<GPUBackend> out_frames_;


Does this actually help? How much? It's against the trend of more aggressive dynamic allocation.

It helped A LOT. For some reason the destructors of these were taking a lot of time.
Anyway, I'm working on removing those auxiliary TensorLists completely, because the .ShareData and .Resize also take significant amount of time.

dali/operators/nvcvop/nvcvop.cc

dali-automaton · 2024-10-07T14:05:39Z

CI MESSAGE: [19099938]: BUILD FAILED

Signed-off-by: Rafal Banas <rbanas@nvidia.com>

banasraf · 2024-10-29T15:37:34Z

!build

dali-automaton · 2024-10-29T15:40:39Z

CI MESSAGE: [19870509]: BUILD FAILED

banasraf · 2024-10-29T16:07:28Z

!build

dali-automaton · 2024-10-29T16:10:13Z

CI MESSAGE: [19871581]: BUILD STARTED

dali-automaton · 2024-10-29T18:21:00Z

CI MESSAGE: [19871581]: BUILD PASSED

mzient · 2024-10-30T09:09:00Z

dali/operators/image/resize/experimental/resize_op_impl_cvcuda.h

@@ -23,6 +23,7 @@
 #include "dali/kernels/imgproc/resample/params.h"
 #include "dali/operators/image/resize/resize_op_impl.h"
 #include "dali/operators/nvcvop/nvcvop.h"
+#include "dali/core/nvtx.h"


mzient · 2024-10-30T18:37:25Z

dali/operators/nvcvop/nvcvop.cc

+  for (int64_t i = 0; i < num_frames; ++i) {
+    if (frame_offset == sample_nframes) {
+      frame_offset = 0;
+      do {
+        ++sample_id;
+        auto sample_shape = input_shape[sample_id];
+        DALI_ENFORCE(sample_id < t_list.num_samples());
+        std::copy(&sample_shape[first_spatial_dim], &sample_shape[input_shape.sample_dim()],
+                  frame_shape.begin());
+        frame_stride = volume(frame_shape) * type_size;
+        sample_nframes = calc_num_frames(sample_shape, first_spatial_dim);
+      } while (sample_nframes * frame_stride == 0);  // we skip empty samples
+      data =
+          static_cast<const uint8_t *>(t_list.raw_tensor(sample_id)) + frame_stride * frame_offset;
+    }
+    tensors.push_back(AsTensor(data, make_span(frame_shape), dtype, nvcv_layout));
+    data += frame_stride;
+    frame_offset++;


I think that combining the two loops and changing the outer loop condition to "while there are frames left to insert" makes it more readable:

Suggested change

for (int64_t i = 0; i < num_frames; ++i) {

if (frame_offset == sample_nframes) {

frame_offset = 0;

do {

++sample_id;

auto sample_shape = input_shape[sample_id];

DALI_ENFORCE(sample_id < t_list.num_samples());

std::copy(&sample_shape[first_spatial_dim], &sample_shape[input_shape.sample_dim()],

frame_shape.begin());

frame_stride = volume(frame_shape) * type_size;

sample_nframes = calc_num_frames(sample_shape, first_spatial_dim);

} while (sample_nframes * frame_stride == 0); // we skip empty samples

data =

static_cast<const uint8_t *>(t_list.raw_tensor(sample_id)) + frame_stride * frame_offset;

}

tensors.push_back(AsTensor(data, make_span(frame_shape), dtype, nvcv_layout));

data += frame_stride;

frame_offset++;

int frames_left = num_frames;

while (frames_left) {

if (frame >= sample_nframes) {

++sample_id;

assert(sample_id < t_list.num_samples());

auto sample_shape = input_shape[sample_id];

std::copy(&sample_shape[first_spatial_dim], &sample_shape[input_shape.sample_dim()],

frame_shape.begin());

frame_stride = volume(frame_shape) * type_size;

if (frame_stride == 0) { // this sample is (effectively) empty - skip

sample_nframes = 0;

continue;

}

sample_nframes = calc_num_frames(sample_shape, first_spatial_dim);

data = static_cast<const uint8_t *>(t_list.raw_tensor(sample_id)) + frame_stride * frame_offset;

}

tensors.push_back(AsTensor(data, make_span(frame_shape), dtype, nvcv_layout));

data += frame_stride;

frame_offset++;

frames_left--;

}

mzient · 2024-10-30T18:38:36Z

dali/operators/nvcvop/nvcvop.cc

+      do {
+        ++sample_id;
+        auto sample_shape = input_shape[sample_id];
+        DALI_ENFORCE(sample_id < t_list.num_samples());


I don't think that either user or faulty data could trigger this - it would be an internal error, so I'd recommend using an assert or throwing logic_error at worst.

mzient · 2024-10-30T18:41:32Z

dali/operators/image/resize/experimental/resize_op_impl_cvcuda.h

+      nvcvop::PushFramesToBatch(mb_input, input, first_spatial_dim_, mb.sample_offset,
+                                mb.frame_offset, mb.count, sample_layout_);
+      nvcvop::PushFramesToBatch(mb_output, output, first_spatial_dim_, mb.sample_offset,
+                                mb.frame_offset, mb.count, sample_layout_);


This is a bug. Both inputs and outputs should be inserted in one go and skipping the empty samples should be based solely on the output size. The user may request resizing a non-empty tensor to (0, 0), which is not an error AFAIR. Resizing an empty input to non-empty shape is an error and should be thrown at some point.

mzient · 2024-10-30T18:45:16Z

dali/operators/image/resize/resize_op_impl.h

+    if (volume(in_sample_shape) > 0)
+      total_frames += volume(&in_sample_shape[0], &in_sample_shape[first_spatial_dim]);


This is a bug - the emptiness of a frame depends on the output shape, not input. At least in old DALI resize, you can resize a non-empty frame to size 0. I understand that such samples should be skipped (both at input and output).
Resizing an empty frame to a non-zero shape is impossible and should throw.

Improve performance of experimental.resize

0ff79a2

Signed-off-by: Rafal Banas <rbanas@nvidia.com>

mzient reviewed Oct 7, 2024

View reviewed changes

dali/operators/nvcvop/nvcvop.cc Outdated Show resolved Hide resolved

dali-automaton assigned mzient and stiepan Oct 8, 2024

banasraf force-pushed the improve-experimental-resize-performance branch 3 times, most recently from 168b119 to 971ba9a Compare October 29, 2024 15:29

Remove intermediate TensorLists. Improve performance

44b9ca2

Signed-off-by: Rafal Banas <rbanas@nvidia.com>

banasraf force-pushed the improve-experimental-resize-performance branch from 971ba9a to 44b9ca2 Compare October 29, 2024 15:35

Merge branch 'main' into improve-experimental-resize-performance

5e38b78

mzient reviewed Oct 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of experimental.resize #5662

Improve performance of experimental.resize #5662

banasraf commented Oct 7, 2024 •

edited

Loading

banasraf commented Oct 7, 2024

dali-automaton commented Oct 7, 2024

mzient Oct 7, 2024

banasraf Oct 7, 2024 •

edited

Loading

dali-automaton commented Oct 7, 2024

banasraf commented Oct 29, 2024

dali-automaton commented Oct 29, 2024

banasraf commented Oct 29, 2024

dali-automaton commented Oct 29, 2024

dali-automaton commented Oct 29, 2024

mzient Oct 30, 2024

mzient Oct 30, 2024 •

edited

Loading

mzient Oct 30, 2024

mzient Oct 30, 2024 •

edited

Loading

mzient Oct 30, 2024

		TensorList<GPUBackend> in_frames_;
		TensorList<GPUBackend> out_frames_;

		if (volume(in_sample_shape) > 0)
		total_frames += volume(&in_sample_shape[0], &in_sample_shape[first_spatial_dim]);

Improve performance of experimental.resize #5662

Are you sure you want to change the base?

Improve performance of experimental.resize #5662

Conversation

banasraf commented Oct 7, 2024 • edited Loading

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

banasraf commented Oct 7, 2024

dali-automaton commented Oct 7, 2024

mzient Oct 7, 2024

Choose a reason for hiding this comment

banasraf Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

dali-automaton commented Oct 7, 2024

banasraf commented Oct 29, 2024

dali-automaton commented Oct 29, 2024

banasraf commented Oct 29, 2024

dali-automaton commented Oct 29, 2024

dali-automaton commented Oct 29, 2024

mzient Oct 30, 2024

Choose a reason for hiding this comment

mzient Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

mzient Oct 30, 2024

Choose a reason for hiding this comment

mzient Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

mzient Oct 30, 2024

Choose a reason for hiding this comment

banasraf commented Oct 7, 2024 •

edited

Loading

banasraf Oct 7, 2024 •

edited

Loading

mzient Oct 30, 2024 •

edited

Loading

mzient Oct 30, 2024 •

edited

Loading