You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When we request sort=True in a cudf.merge, the current implementation does:
deduce left and right join columns
join, producing left and right gather maps
gather left and right columns, and merge results
deduce key columns to sort by
argsort the key columns
gather the result using the argsort return value
Trivially, steps 5 and 6 can be merged into a sort_by_key (that's #13557). However, this order probably does more data movement than it needs to. This makes two calls to gather, and one sort-by-key, at the cost of moving the full dataframe through memory twice (once in step 3, once in step 6).
Instead, we could (if sorting) first gather only the key columns we will sort by, argsort those and then use that ordering to sort the left and right gather maps.
deduce left and right join columns
join, producing left and right gather maps
deduce left and right key columns to order by
gather left key columns with left map, right key columns with right map
sort-by-key the left and right gather maps with the columns from step 4
gather left and right columns with new gather maps and merge
This makes four calls to gather and one sort-by-key, but only moves the full dataframe through memory once (in step 6). For dataframes with many non-key columns this might well be an advantage. The latency will be a bit higher, but the total data movement will be less. For example, consider (for simplicity) a left join with one key column and 10 total columns in both left and right dataframes.
The current approach (once the left and right gather maps have been determined) gathers 20 columns in step 3, argsorts one column, then gathers 20 columns again (sort-by-key merges the sort + gather into argsort + gather at the libcudf level).
The proposed alternative would gather 1 column in step 4, sorts-by-key two columns (the two gather maps), then gathers 20 columns. So we move effectively 23 columns through memory rather than 41.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
When we request
sort=True
in acudf.merge
, the current implementation does:Trivially, steps 5 and 6 can be merged into a
sort_by_key
(that's #13557). However, this order probably does more data movement than it needs to. This makes two calls to gather, and one sort-by-key, at the cost of moving the full dataframe through memory twice (once in step 3, once in step 6).Instead, we could (if sorting) first gather only the key columns we will sort by, argsort those and then use that ordering to sort the left and right gather maps.
This makes four calls to gather and one sort-by-key, but only moves the full dataframe through memory once (in step 6). For dataframes with many non-key columns this might well be an advantage. The latency will be a bit higher, but the total data movement will be less. For example, consider (for simplicity) a left join with one key column and 10 total columns in both left and right dataframes.
The current approach (once the left and right gather maps have been determined) gathers 20 columns in step 3, argsorts one column, then gathers 20 columns again (sort-by-key merges the sort + gather into argsort + gather at the libcudf level).
The proposed alternative would gather 1 column in step 4, sorts-by-key two columns (the two gather maps), then gathers 20 columns. So we move effectively 23 columns through memory rather than 41.
The text was updated successfully, but these errors were encountered: