[ENH] benchmark gather then sort vs sort then gather in merge with `sort=True` #13630

wence- · 2023-06-28T11:56:18Z

Is your feature request related to a problem? Please describe.

When we request sort=True in a cudf.merge, the current implementation does:

deduce left and right join columns
join, producing left and right gather maps
gather left and right columns, and merge results
deduce key columns to sort by
argsort the key columns
gather the result using the argsort return value

Trivially, steps 5 and 6 can be merged into a sort_by_key (that's #13557). However, this order probably does more data movement than it needs to. This makes two calls to gather, and one sort-by-key, at the cost of moving the full dataframe through memory twice (once in step 3, once in step 6).

Instead, we could (if sorting) first gather only the key columns we will sort by, argsort those and then use that ordering to sort the left and right gather maps.

deduce left and right join columns
join, producing left and right gather maps
deduce left and right key columns to order by
gather left key columns with left map, right key columns with right map
sort-by-key the left and right gather maps with the columns from step 4
gather left and right columns with new gather maps and merge

This makes four calls to gather and one sort-by-key, but only moves the full dataframe through memory once (in step 6). For dataframes with many non-key columns this might well be an advantage. The latency will be a bit higher, but the total data movement will be less. For example, consider (for simplicity) a left join with one key column and 10 total columns in both left and right dataframes.

The current approach (once the left and right gather maps have been determined) gathers 20 columns in step 3, argsorts one column, then gathers 20 columns again (sort-by-key merges the sort + gather into argsort + gather at the libcudf level).

The proposed alternative would gather 1 column in step 4, sorts-by-key two columns (the two gather maps), then gathers 20 columns. So we move effectively 23 columns through memory rather than 41.

The text was updated successfully, but these errors were encountered:

wence- added feature request New feature or request Performance Performance related issue labels Jun 28, 2023

github-project-automation bot added this to cuDF/Dask/Numba/UCX Jun 28, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Jun 28, 2023

wence- mentioned this issue Nov 23, 2023

[ENH] Audit cudf APIs for use of inappropriate algorithms #14479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] benchmark gather then sort vs sort then gather in merge with `sort=True` #13630

[ENH] benchmark gather then sort vs sort then gather in merge with `sort=True` #13630

wence- commented Jun 28, 2023

[ENH] benchmark gather then sort vs sort then gather in merge with sort=True #13630

[ENH] benchmark gather then sort vs sort then gather in merge with sort=True #13630

Comments

wence- commented Jun 28, 2023

[ENH] benchmark gather then sort vs sort then gather in merge with `sort=True` #13630

[ENH] benchmark gather then sort vs sort then gather in merge with `sort=True` #13630