Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] benchmark gather then sort vs sort then gather in merge with sort=True #13630

Open
Tracked by #14479
wence- opened this issue Jun 28, 2023 · 0 comments
Open
Tracked by #14479
Labels
feature request New feature or request Performance Performance related issue

Comments

@wence-
Copy link
Contributor

wence- commented Jun 28, 2023

Is your feature request related to a problem? Please describe.

When we request sort=True in a cudf.merge, the current implementation does:

  1. deduce left and right join columns
  2. join, producing left and right gather maps
  3. gather left and right columns, and merge results
  4. deduce key columns to sort by
  5. argsort the key columns
  6. gather the result using the argsort return value

Trivially, steps 5 and 6 can be merged into a sort_by_key (that's #13557). However, this order probably does more data movement than it needs to. This makes two calls to gather, and one sort-by-key, at the cost of moving the full dataframe through memory twice (once in step 3, once in step 6).

Instead, we could (if sorting) first gather only the key columns we will sort by, argsort those and then use that ordering to sort the left and right gather maps.

  1. deduce left and right join columns
  2. join, producing left and right gather maps
  3. deduce left and right key columns to order by
  4. gather left key columns with left map, right key columns with right map
  5. sort-by-key the left and right gather maps with the columns from step 4
  6. gather left and right columns with new gather maps and merge

This makes four calls to gather and one sort-by-key, but only moves the full dataframe through memory once (in step 6). For dataframes with many non-key columns this might well be an advantage. The latency will be a bit higher, but the total data movement will be less. For example, consider (for simplicity) a left join with one key column and 10 total columns in both left and right dataframes.

The current approach (once the left and right gather maps have been determined) gathers 20 columns in step 3, argsorts one column, then gathers 20 columns again (sort-by-key merges the sort + gather into argsort + gather at the libcudf level).

The proposed alternative would gather 1 column in step 4, sorts-by-key two columns (the two gather maps), then gathers 20 columns. So we move effectively 23 columns through memory rather than 41.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Performance Performance related issue
Projects
Status: In Progress
Development

No branches or pull requests

1 participant