[BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed intermittently #4060

pxLi · 2021-11-09T03:41:03Z

Describe the bug
related to #3770

rapids_databricks_nightly-dev-github build ID 212

[2021-11-09T03:30:21.638Z] ../../src/main/python/hash_aggregate_test.py:1163: 
[2021-11-09T03:30:21.638Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2021-11-09T03:30:21.638Z] 
[2021-11-09T03:30:21.638Z] df_fun = <function test_hash_groupby_approx_percentile_long_repeated_keys.<locals>.<lambda> at 0x7f11f7f65320>
[2021-11-09T03:30:21.638Z] percentiles = [0.05, 0.25, 0.5, 0.75, 0.95]
[2021-11-09T03:30:21.638Z] conf = {'spark.rapids.sql.expression.ApproximatePercentile': 'true', 'spark.sql.adaptive.enabled': 'false'}
[2021-11-09T03:30:21.638Z] 
[2021-11-09T03:30:21.638Z]     def compare_percentile_approx(df_fun, percentiles, conf):
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         # create SQL statements for exact and approx percentiles
[2021-11-09T03:30:21.638Z]         p_exact_sql = create_percentile_sql("percentile", percentiles)
[2021-11-09T03:30:21.638Z]         p_approx_sql = create_percentile_sql("approx_percentile", percentiles)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         def run_exact(spark):
[2021-11-09T03:30:21.638Z]             df = df_fun(spark)
[2021-11-09T03:30:21.638Z]             df.createOrReplaceTempView("t")
[2021-11-09T03:30:21.638Z]             return spark.sql(p_exact_sql)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         def run_approx(spark):
[2021-11-09T03:30:21.638Z]             df = df_fun(spark)
[2021-11-09T03:30:21.638Z]             df.createOrReplaceTempView("t")
[2021-11-09T03:30:21.638Z]             return spark.sql(p_approx_sql)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         # run exact percentile on CPU
[2021-11-09T03:30:21.638Z]         exact = run_with_cpu(run_exact, 'COLLECT', _approx_percentile_conf)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         # run approx_percentile on CPU and GPU
[2021-11-09T03:30:21.638Z]         approx_cpu, approx_gpu = run_with_cpu_and_gpu(run_approx, 'COLLECT', _approx_percentile_conf)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         assert len(exact) == len(approx_cpu)
[2021-11-09T03:30:21.638Z]         assert len(exact) == len(approx_gpu)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         for i in range(len(exact)):
[2021-11-09T03:30:21.638Z]             cpu_exact_result = exact[i]
[2021-11-09T03:30:21.638Z]             cpu_approx_result = approx_cpu[i]
[2021-11-09T03:30:21.638Z]             gpu_approx_result = approx_gpu[i]
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]             # assert that keys match
[2021-11-09T03:30:21.638Z]             assert cpu_exact_result['k'] == cpu_approx_result['k']
[2021-11-09T03:30:21.639Z]             assert cpu_exact_result['k'] == gpu_approx_result['k']
[2021-11-09T03:30:21.639Z]     
[2021-11-09T03:30:21.639Z]             # extract the percentile result column
[2021-11-09T03:30:21.639Z]             exact_percentile = cpu_exact_result['the_percentile']
[2021-11-09T03:30:21.639Z]             cpu_approx_percentile = cpu_approx_result['the_percentile']
[2021-11-09T03:30:21.639Z]             gpu_approx_percentile = gpu_approx_result['the_percentile']
[2021-11-09T03:30:21.639Z]     
[2021-11-09T03:30:21.639Z]             if exact_percentile is None:
[2021-11-09T03:30:21.639Z]                 assert cpu_approx_percentile is None
[2021-11-09T03:30:21.639Z]                 assert gpu_approx_percentile is None
[2021-11-09T03:30:21.639Z]             else:
[2021-11-09T03:30:21.639Z]                 assert cpu_approx_percentile is not None
[2021-11-09T03:30:21.639Z] >               assert gpu_approx_percentile is not None
[2021-11-09T03:30:21.639Z] E               assert None is not None
[2021-11-09T03:30:21.639Z] 
[2021-11-09T03:30:21.639Z] ../../src/main/python/hash_aggregate_test.py:1263: AssertionError

The text was updated successfully, but these errors were encountered:

pxLi · 2021-11-09T03:49:24Z

@andygrove can you help take a look? thanks~

andygrove · 2021-11-09T16:29:52Z

@pxLi How can I confirm that the cudf jar in this run included rapidsai/cudf#9537 which should be the fix for this issue?

andygrove · 2021-11-09T19:50:01Z

I just tested with Databricks 7.3 and I cannot reproduce the issue.

[gw0] [ 50%] PASSED ../../src/main/python/hash_aggregate_test.py::test_hash_groupby_approx_percentile_long_repeated_keys[false] 
[gw1] [100%] PASSED ../../src/main/python/hash_aggregate_test.py::test_hash_groupby_approx_percentile_long_repeated_keys[true]

pxLi · 2021-11-10T00:39:45Z

@pxLi How can I confirm that the cudf jar in this run included rapidsai/cudf#9537 which should be the fix for this issue?

the commit info of cudf jar can be simply fetched as
unzip -p "cudf.jar" "cudf-java-version-info.properties"
for nightly we always use the latest snapshot build from cudf-nightly pipeline (cudf_nightly-dev-github)

we also see failures today in
rapids_it-Dataproc build ID 350 (cudf 3280be2)

but it passed in rapids_databricks_nightly-dev-github today,
and rapids_it-Dataproc build ID 351 (retrigger w/ the same commit 3280be2)

looks like the test result is non-deterministic. we may not always reproduce the error here

I checked the cudf jar in these failed builds, it is based on 3280be2 which should include the #9537

pxLi · 2021-11-18T01:31:07Z

also seeing ../../src/main/python/hash_aggregate_test.py::test_hash_groupby_approx_percentile_long_repeated_keys[false]

intermittently hanging and timeout some builds, e.g.
rapids_it-3.0.x-SNAPSHOT-dev-github build ID 256
rapids_it-3.2.x-SNAPSHOT-pre_release-github build ID 15

andygrove · 2021-11-19T20:03:15Z

The issue appears to be that the last bucket in the t-digest data is corrupted (intermittently). Here are the values for the last 10 buckets in the t-digest when the test is passing:

mean

1950 9882.88
1951 9884.85
1952 9888.76
1953 9890.01
1954 9922.68
1955 9967.43
1956 9975.05
1957 9987.18
1958 9994.150000000001
1959 0.0

weight

Here are the last 10 values when the test is failing. The last entry for both mean and weight is incorrect.

mean

1950 9882.88
1951 9884.85
1952 9888.76
1953 9890.01
1954 9922.68
1955 9967.43
1956 9975.05
1957 9987.18
1958 9994.150000000001
1959 2.2310560923097894E208

weight

1950 1.0
1951 1.0
1952 1.0
1953 1.0
1954 1.0
1955 1.0
1956 1.0
1957 1.0
1958 2.0
1959 5.932533616687933E276

mythrocks · 2021-11-19T20:23:38Z

If I'm not mistaken, the last few entries in the mean list are worrying, for the "passing" case:

1957 9987.18
1958 9994.150000000001
1959 0.0

Aren't these sorted in increasing value of mean? 0.0 would be wrong then, no?

sameerz · 2021-11-22T19:50:46Z

approx_percentile is off by default, we are going to document this issue (PR #4173) and fix in 22.02.

relates to #4060 skip some of the tests that intermittently fail in 21.12 to make sure they don't affect CI and release. Signed-off-by: Thomas Graves <tgraves@nvidia.com>

`detail::segmented_gather()` inadvertently uses `cuda_default_stream` in some parts of its implementation, while using the user-specified stream in others. This applies to the calls to `copy_range_in_place()`, `allocate_like()`, and `make_lists_column()`. ~This might produce race conditions, which might explain NVIDIA/spark-rapids/issues/4060. It's a rare failure that's quite hard to reproduce.~ This might lead to over-synchronization, though bad output is unlikely. The commit here should sort this out, by switching to the `detail` APIs corresponding to the calls above. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #9679

Signed-off-by: Andy Grove <andygrove@nvidia.com>

pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 9, 2021

tgravescs mentioned this issue Nov 9, 2021

[BUG]Dataproc test_hash_groupby_approx_percentile_long_repeated_keys failures #4062

Closed

Salonijain27 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Nov 9, 2021

Salonijain27 assigned andygrove Nov 9, 2021

mythrocks mentioned this issue Nov 13, 2021

Fix stream usage in segmented_gather() rapidsai/cudf#9679

Merged

andygrove mentioned this issue Nov 17, 2021

IGNORE: approx percentile debug logging [databricks] #4140

Closed

revans2 mentioned this issue Nov 18, 2021

Decimal128 Support [databricks] #4139

Merged

pxLi changed the title ~~[BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed in databricks 7.3 runtime~~ [BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed intermittently Nov 19, 2021

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Nov 22, 2021

tgravescs mentioned this issue Nov 23, 2021

Disable approx percentile tests that intermittently fail #4200

Merged

nvdbaranec mentioned this issue Dec 17, 2021

Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. rapidsai/cudf#9931

Merged

andygrove mentioned this issue Dec 20, 2021

Enable approx percentile tests [databricks] #4400

Merged

andygrove added this to the Dec 13 - Jan 7 milestone Jan 7, 2022

rongou closed this as completed in rongou/cudf@120aa62 Jan 10, 2022

andygrove added a commit to andygrove/spark-rapids that referenced this issue Apr 25, 2022

Remove compatibility guide reference to issue NVIDIA#4060

cae4eba

andygrove added a commit to andygrove/spark-rapids that referenced this issue Apr 25, 2022

Remove compatibility guide reference to issue NVIDIA#4060

dd0bb25

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove mentioned this issue Apr 25, 2022

Remove compatibility guide reference to issue #4060 #5307

Merged

jlowe pushed a commit that referenced this issue Apr 25, 2022

Remove compatibility guide reference to issue #4060 (#5307)

de2896e

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed intermittently #4060

[BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed intermittently #4060

pxLi commented Nov 9, 2021 •

edited

Loading

pxLi commented Nov 9, 2021

andygrove commented Nov 9, 2021

andygrove commented Nov 9, 2021

pxLi commented Nov 10, 2021 •

edited

Loading

pxLi commented Nov 18, 2021

andygrove commented Nov 19, 2021

mythrocks commented Nov 19, 2021

sameerz commented Nov 22, 2021

[BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed intermittently #4060

[BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed intermittently #4060

Comments

pxLi commented Nov 9, 2021 • edited Loading

pxLi commented Nov 9, 2021

andygrove commented Nov 9, 2021

andygrove commented Nov 9, 2021

pxLi commented Nov 10, 2021 • edited Loading

pxLi commented Nov 18, 2021

andygrove commented Nov 19, 2021

mean

weight

mean

weight

mythrocks commented Nov 19, 2021

sameerz commented Nov 22, 2021

pxLi commented Nov 9, 2021 •

edited

Loading

pxLi commented Nov 10, 2021 •

edited

Loading