-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conform "bench_isin" to match generator column names #11549
Conform "bench_isin" to match generator column names #11549
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix looks good! One comment on parametrize with GPU data. We can fix that here, or in the future. Either way.
cudf.Series(range(1000)), | ||
range(50), | ||
{f"{string.ascii_lowercase[i]}": range(50) for i in range(10)}, | ||
cudf.DataFrame({f"{string.ascii_lowercase[i]}": range(50) for i in range(10)}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem isn’t new in this PR, but we should try to avoid constructing objects with GPU-resident data in the parametrize
call. Doing so causes GPU allocations while collecting tests, which happens before runtime and occurs for every test regardless of whether it is selected (e.g. with pytest -k
). An alternative solution is to pass a class and arguments, then construct the object at runtime. Even passing a parameter that is a lambda like lambda: cudf.DataFrame(…)
and calling that in the function body will suffice to make the call lazier and execute at runtime instead of eagerly at collection time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @bdice, is this change that you are proposing?
@benchmark_with_object(cls="dataframe", dtype="int")
@pytest.mark.parametrize(
"values",
[
lambda: range(50),
lambda: {f"{string.ascii_lowercase[i]}": range(50) for i in range(10)},
lambda: cudf.DataFrame({f"{string.ascii_lowercase[i]}": range(50) for i in range(10)}),
lambda: cudf.Series(range(50)),
],
)
def bench_isin(benchmark, dataframe, values):
benchmark(dataframe.isin, values())
isin
is one of the few benchmarks with DataFrame or Series as parameters, so I think it makes sense to address the lazy evaluation idea in this PR. The suggestion defers the evaluation of the parameters, but it makes values
a callable which makes it different from the other benchmarks.
@vyasr What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a sufficient solution for deferring execution from test collection-time to test runtime, yes! (Thank you for reading between the lines here, I wrote the original comment on mobile and couldn't write the actual code.)
In my view, values
being a callable here is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr I'm fine with using this suggestion for now. I don't think we should try to address the lazy evaluation question more generally in this PR because there are open ongoing discussions around it that also impact all our tests.
Avoiding collection-time creation of parametrized inputs is one of the many reasons that I advocated the use of pytest-cases
, which provides more elegant solutions for this, but we tabled that discussion in #11122 to prioritize getting the rest of that PR merged before my vacation. I plan to pick that discussion up again soon, and depending on where we come down we will either switch to using cases or enshrine some preferred solution for this issue into our developer documentation. I think that your current solution is fine in this instance, but there are more complex cases where the results are unreadable and I would like to figure out a consistent solution to apply across all our tests/benchmarks.
c0ca0a6
to
b80105b
Compare
Codecov ReportBase: 87.40% // Head: 87.48% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11549 +/- ##
================================================
+ Coverage 87.40% 87.48% +0.07%
================================================
Files 133 133
Lines 21833 21864 +31
================================================
+ Hits 19084 19128 +44
+ Misses 2749 2736 -13
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@gpucibot merge |
Description
The version of
bench_isin
merged in #11125 used key and column names of the formatf"key{i}"
rather than the formatf"{string.ascii_lowercase[i]}"
as is used in the dataframe generator. As a result theisin
benchmark using a dictionary argument short-circuits with no matching keys, and theisin
benchmark using a dataframe argument finds no matches.This PR also adjusts the
isin
arguments fromrange(1000)
torange(50)
to better match the input dataframe cardinality of 100. Withrange(1000)
, every element matches but withrange(50)
only 50% of the elements match.Checklist