-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
POC: Use khash sets instead of maps for isin #53059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just found out about the perf tool to measure branch prediction misses. Just ran this for a benchmark:
On main I get the following: Performance counter stats for 'python -c import pandas as pd; import numpy as np; N = 1_000_000; ser = pd.Series(np.arange(N)); vals = np.arange(N); ser.isin(vals)':
1,965.47 msec task-clock # 2.357 CPUs utilized
147 context-switches # 74.791 /sec
0 cpu-migrations # 0.000 /sec
31,457 page-faults # 16.005 K/sec
3,936,570,319 cpu_core/cycles/ # 2.003 G/sec
3,802,321,801 cpu_atom/cycles/ # 1.935 G/sec (52.15%)
7,214,481,236 cpu_core/instructions/ # 3.671 G/sec
8,924,412,911 cpu_atom/instructions/ # 4.541 G/sec (52.15%)
1,376,744,013 cpu_core/branches/ # 700.464 M/sec
1,457,321,116 cpu_atom/branches/ # 741.460 M/sec (52.15%)
23,755,435 cpu_core/branch-misses/ # 12.086 M/sec
76,473 cpu_atom/branch-misses/ # 38.908 K/sec (52.15%)
19,950,214,260 cpu_core/slots/ # 10.150 G/sec
6,988,756,509 cpu_core/topdown-retiring/ # 34.9% Retiring
3,077,912,645 cpu_core/topdown-bad-spec/ # 15.4% Bad Speculation
5,841,369,682 cpu_core/topdown-fe-bound/ # 29.2% Frontend Bound
4,090,935,951 cpu_core/topdown-be-bound/ # 20.5% Backend Bound
607,579,968 cpu_core/topdown-heavy-ops/ # 3.0% Heavy Operations # 31.9% Light Operations
2,935,493,211 cpu_core/topdown-br-mispredict/ # 14.7% Branch Mispredict # 0.7% Machine Clears
2,838,977,426 cpu_core/topdown-fetch-lat/ # 14.2% Fetch Latency # 15.0% Fetch Bandwidth
2,956,225,784 cpu_core/topdown-mem-bound/ # 14.8% Memory Bound # 5.7% Core Bound
0.833991099 seconds time elapsed
1.010541000 seconds user
0.958761000 seconds sys Versus this PR: Performance counter stats for 'python -c import pandas as pd; import numpy as np; N = 1_000_000; ser = pd.Series(np.arange(N)); vals = np.arange(N); ser.isin(vals)':
1,954.42 msec task-clock # 2.370 CPUs utilized
215 context-switches # 110.007 /sec
2 cpu-migrations # 1.023 /sec
26,302 page-faults # 13.458 K/sec
3,927,968,672 cpu_core/cycles/ # 2.010 G/sec
3,786,117,128 cpu_atom/cycles/ # 1.937 G/sec (52.56%)
7,217,333,673 cpu_core/instructions/ # 3.693 G/sec
8,849,613,601 cpu_atom/instructions/ # 4.528 G/sec (52.56%)
1,378,602,285 cpu_core/branches/ # 705.376 M/sec
1,443,771,308 cpu_atom/branches/ # 738.720 M/sec (52.56%)
23,818,420 cpu_core/branch-misses/ # 12.187 M/sec
299,996 cpu_atom/branch-misses/ # 153.496 K/sec (52.56%)
19,880,779,164 cpu_core/slots/ # 10.172 G/sec
6,961,062,372 cpu_core/topdown-retiring/ # 35.0% Retiring
3,064,425,117 cpu_core/topdown-bad-spec/ # 15.4% Bad Speculation
5,844,723,817 cpu_core/topdown-fe-bound/ # 29.4% Frontend Bound
4,026,716,906 cpu_core/topdown-be-bound/ # 20.2% Backend Bound
600,833,950 cpu_core/topdown-heavy-ops/ # 3.0% Heavy Operations # 32.0% Light Operations
2,926,332,561 cpu_core/topdown-br-mispredict/ # 14.7% Branch Mispredict # 0.7% Machine Clears
2,818,938,163 cpu_core/topdown-fetch-lat/ # 14.2% Fetch Latency # 15.2% Fetch Bandwidth
2,943,103,050 cpu_core/topdown-mem-bound/ # 14.8% Memory Bound # 5.4% Core Bound
0.824527840 seconds time elapsed
1.034003000 seconds user
0.924299000 seconds sys So not sure this really makes a big difference for branch predictions. Will have to research more |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Closing for now but would be nice to pick up again in the future |
Expanded this to other primitives aside from int64 / uint64. Across the board seems like this helps, although there are some regressions. Needs more investigation
Haven't done the PyObject case yet as the naming conventions there don't follow the same conventions as described here. Might need to tackle that in a follow up |
OK got everything set up. Some of the prior regressions were due to improper macro use / declarations. Surprised those didn't throw compiler errors...but there are many layers of indirection between tempita / khash. Something to investigate another day... Here are the results for a full run of the isin benchmarks - looks like this does help with scalability as larger datasets are showing 20-50% improvement @mroeschke @jbrockmendel @realead for review
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have too much background to comment here but nice that tests are passing
This is a POC towards what @realead described in #39799
The IsIn benchmarks overall seemed a bit unreliable, but I could consistently get results of
algos.isin.IsinWithArangeSorted
that look like:The performance improvement on the largest dataset might be in line with @realead expectation that
For big datasets, the running time of the above algorithms is dominated by cache-misses. Thus having twice as many cache-misses, because also values are touched could mean a factor 2 slowdown