Skip to content

[BUG]: Serious performance regression in branch 3.2.0 in cudf on Turing #7097

@ttnghia

Description

@ttnghia

Is this a duplicate?

Type of Bug

Performance

Component

CUB

Describe the bug

In Rapids, after switching CCCL branch to 3.2.0, we observe a preformance regression that causes some of our code path to run at 2X slower speed. For example:

## [0] Quadro RTX 6000

|  num_rows  |  depth  |  null_frequency  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|------------|---------|------------------|------------|-------------|------------|-------------|-----------|---------|----------|
|    1024    |    4    |        0         |  17.769 ms |       5.34% |  39.703 ms |       5.35% | 21.934 ms | 123.44% |   SLOW   |

We had to do bisection on CCCL commits branch 3.2.X to investigate. The bisection process is not easy, as we had to go back in time across various repositories at the same time, fixing various building issues (due to changes in the build systems across various projects).

Finally, the root cause is found:

Checked out that commit:

| num_rows | depth | null_frequency | Samples | CPU Time  | Noise | GPU Time  | Noise |
|----------|-------|----------------|---------|-----------|-------|-----------|-------|
|     1024 |     4 |              0 |    374x | 40.160 ms | 5.06% | 40.154 ms | 5.06% |

Moving back one commit before it, and the performance gets back to normal:

| num_rows | depth | null_frequency | Samples | CPU Time  | Noise | GPU Time  | Noise |
|----------|-------|----------------|---------|-----------|-------|-----------|-------|
|     1024 |     4 |              0 |    647x | 23.169 ms | 5.68% | 23.163 ms | 5.66% |

Since 2X slow down regression is serious, we would like to have it fixed for our 26.02 release.

How to Reproduce

In cudf, build the benchmark SET_OPS_NVBENCH and run this particular one:

SET_OPS_NVBENCH --benchmark have_overlap --axis null_frequency=0 --axis depth=4 --axis num_rows=1024

Expected behavior

The runtime should be around 20ms.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

Metadata

Metadata

Labels

bugSomething isn't working right.

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions