Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed intermittently #4060

Closed
pxLi opened this issue Nov 9, 2021 · 8 comments · Fixed by rapidsai/cudf#9931 or #4400
Closed
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P0 Must have for release

Comments

@pxLi
Copy link
Collaborator

pxLi commented Nov 9, 2021

Describe the bug
related to #3770

rapids_databricks_nightly-dev-github build ID 212

[2021-11-09T03:30:21.638Z] ../../src/main/python/hash_aggregate_test.py:1163: 
[2021-11-09T03:30:21.638Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2021-11-09T03:30:21.638Z] 
[2021-11-09T03:30:21.638Z] df_fun = <function test_hash_groupby_approx_percentile_long_repeated_keys.<locals>.<lambda> at 0x7f11f7f65320>
[2021-11-09T03:30:21.638Z] percentiles = [0.05, 0.25, 0.5, 0.75, 0.95]
[2021-11-09T03:30:21.638Z] conf = {'spark.rapids.sql.expression.ApproximatePercentile': 'true', 'spark.sql.adaptive.enabled': 'false'}
[2021-11-09T03:30:21.638Z] 
[2021-11-09T03:30:21.638Z]     def compare_percentile_approx(df_fun, percentiles, conf):
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         # create SQL statements for exact and approx percentiles
[2021-11-09T03:30:21.638Z]         p_exact_sql = create_percentile_sql("percentile", percentiles)
[2021-11-09T03:30:21.638Z]         p_approx_sql = create_percentile_sql("approx_percentile", percentiles)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         def run_exact(spark):
[2021-11-09T03:30:21.638Z]             df = df_fun(spark)
[2021-11-09T03:30:21.638Z]             df.createOrReplaceTempView("t")
[2021-11-09T03:30:21.638Z]             return spark.sql(p_exact_sql)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         def run_approx(spark):
[2021-11-09T03:30:21.638Z]             df = df_fun(spark)
[2021-11-09T03:30:21.638Z]             df.createOrReplaceTempView("t")
[2021-11-09T03:30:21.638Z]             return spark.sql(p_approx_sql)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         # run exact percentile on CPU
[2021-11-09T03:30:21.638Z]         exact = run_with_cpu(run_exact, 'COLLECT', _approx_percentile_conf)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         # run approx_percentile on CPU and GPU
[2021-11-09T03:30:21.638Z]         approx_cpu, approx_gpu = run_with_cpu_and_gpu(run_approx, 'COLLECT', _approx_percentile_conf)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         assert len(exact) == len(approx_cpu)
[2021-11-09T03:30:21.638Z]         assert len(exact) == len(approx_gpu)
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]         for i in range(len(exact)):
[2021-11-09T03:30:21.638Z]             cpu_exact_result = exact[i]
[2021-11-09T03:30:21.638Z]             cpu_approx_result = approx_cpu[i]
[2021-11-09T03:30:21.638Z]             gpu_approx_result = approx_gpu[i]
[2021-11-09T03:30:21.638Z]     
[2021-11-09T03:30:21.638Z]             # assert that keys match
[2021-11-09T03:30:21.638Z]             assert cpu_exact_result['k'] == cpu_approx_result['k']
[2021-11-09T03:30:21.639Z]             assert cpu_exact_result['k'] == gpu_approx_result['k']
[2021-11-09T03:30:21.639Z]     
[2021-11-09T03:30:21.639Z]             # extract the percentile result column
[2021-11-09T03:30:21.639Z]             exact_percentile = cpu_exact_result['the_percentile']
[2021-11-09T03:30:21.639Z]             cpu_approx_percentile = cpu_approx_result['the_percentile']
[2021-11-09T03:30:21.639Z]             gpu_approx_percentile = gpu_approx_result['the_percentile']
[2021-11-09T03:30:21.639Z]     
[2021-11-09T03:30:21.639Z]             if exact_percentile is None:
[2021-11-09T03:30:21.639Z]                 assert cpu_approx_percentile is None
[2021-11-09T03:30:21.639Z]                 assert gpu_approx_percentile is None
[2021-11-09T03:30:21.639Z]             else:
[2021-11-09T03:30:21.639Z]                 assert cpu_approx_percentile is not None
[2021-11-09T03:30:21.639Z] >               assert gpu_approx_percentile is not None
[2021-11-09T03:30:21.639Z] E               assert None is not None
[2021-11-09T03:30:21.639Z] 
[2021-11-09T03:30:21.639Z] ../../src/main/python/hash_aggregate_test.py:1263: AssertionError
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 9, 2021
@pxLi
Copy link
Collaborator Author

pxLi commented Nov 9, 2021

@andygrove can you help take a look? thanks~

@andygrove
Copy link
Contributor

@pxLi How can I confirm that the cudf jar in this run included rapidsai/cudf#9537 which should be the fix for this issue?

@andygrove
Copy link
Contributor

I just tested with Databricks 7.3 and I cannot reproduce the issue.

[gw0] [ 50%] PASSED ../../src/main/python/hash_aggregate_test.py::test_hash_groupby_approx_percentile_long_repeated_keys[false] 
[gw1] [100%] PASSED ../../src/main/python/hash_aggregate_test.py::test_hash_groupby_approx_percentile_long_repeated_keys[true] 

@Salonijain27 Salonijain27 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Nov 9, 2021
@pxLi
Copy link
Collaborator Author

pxLi commented Nov 10, 2021

@pxLi How can I confirm that the cudf jar in this run included rapidsai/cudf#9537 which should be the fix for this issue?

the commit info of cudf jar can be simply fetched as
unzip -p "cudf.jar" "cudf-java-version-info.properties"
for nightly we always use the latest snapshot build from cudf-nightly pipeline (cudf_nightly-dev-github)

we also see failures today in
rapids_it-Dataproc build ID 350 (cudf 3280be2)

but it passed in rapids_databricks_nightly-dev-github today,
and rapids_it-Dataproc build ID 351 (retrigger w/ the same commit 3280be2)

looks like the test result is non-deterministic. we may not always reproduce the error here

I checked the cudf jar in these failed builds, it is based on 3280be2 which should include the #9537

@pxLi
Copy link
Collaborator Author

pxLi commented Nov 18, 2021

also seeing ../../src/main/python/hash_aggregate_test.py::test_hash_groupby_approx_percentile_long_repeated_keys[false]

intermittently hanging and timeout some builds, e.g.
rapids_it-3.0.x-SNAPSHOT-dev-github build ID 256
rapids_it-3.2.x-SNAPSHOT-pre_release-github build ID 15

image
image

@pxLi pxLi changed the title [BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed in databricks 7.3 runtime [BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed intermittently Nov 19, 2021
@andygrove
Copy link
Contributor

The issue appears to be that the last bucket in the t-digest data is corrupted (intermittently). Here are the values for the last 10 buckets in the t-digest when the test is passing:

mean

1950 9882.88
1951 9884.85
1952 9888.76
1953 9890.01
1954 9922.68
1955 9967.43
1956 9975.05
1957 9987.18
1958 9994.150000000001
1959 0.0

weight

1950 1.0
1951 1.0
1952 1.0
1953 1.0
1954 1.0
1955 1.0
1956 1.0
1957 1.0
1958 2.0
1959 0.0

Here are the last 10 values when the test is failing. The last entry for both mean and weight is incorrect.

mean

1950 9882.88
1951 9884.85
1952 9888.76
1953 9890.01
1954 9922.68
1955 9967.43
1956 9975.05
1957 9987.18
1958 9994.150000000001
1959 2.2310560923097894E208

weight

1950 1.0
1951 1.0
1952 1.0
1953 1.0
1954 1.0
1955 1.0
1956 1.0
1957 1.0
1958 2.0
1959 5.932533616687933E276

@mythrocks
Copy link
Collaborator

If I'm not mistaken, the last few entries in the mean list are worrying, for the "passing" case:

1957 9987.18
1958 9994.150000000001
1959 0.0

Aren't these sorted in increasing value of mean? 0.0 would be wrong then, no?

@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Nov 22, 2021
@sameerz
Copy link
Collaborator

sameerz commented Nov 22, 2021

approx_percentile is off by default, we are going to document this issue (PR #4173) and fix in 22.02.

gerashegalov pushed a commit that referenced this issue Nov 24, 2021
relates to #4060

skip some of the tests that intermittently fail in 21.12 to make sure they don't affect CI and release.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Dec 2, 2021
`detail::segmented_gather()` inadvertently uses `cuda_default_stream` in some parts of its implementation, while using the user-specified stream in others.

This applies to the calls to `copy_range_in_place()`, `allocate_like()`, and `make_lists_column()`. ~This might produce race conditions, which might explain NVIDIA/spark-rapids/issues/4060. It's a rare failure that's quite hard to reproduce.~ This might lead to over-synchronization, though bad output is unlikely.

The commit here should sort this out, by switching to the `detail` APIs corresponding to the calls above.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #9679
@andygrove andygrove added this to the Dec 13 - Jan 7 milestone Jan 7, 2022
andygrove added a commit to andygrove/spark-rapids that referenced this issue Apr 25, 2022
andygrove added a commit to andygrove/spark-rapids that referenced this issue Apr 25, 2022
Signed-off-by: Andy Grove <andygrove@nvidia.com>
jlowe pushed a commit that referenced this issue Apr 25, 2022
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P0 Must have for release
Projects
None yet
5 participants