Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize merge algorithm for data sizes equal or greater then 4M items #1933

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

SergeyKopienko
Copy link
Contributor

@SergeyKopienko SergeyKopienko commented Nov 6, 2024

In this PR we optimize merge algorithm for data sizes equal or greater then 4M items.
The main idea - we doing two submits:

  1. in the first submit we find split point in some "base" diagonal's subset.
  2. in the second submit we find split points in all other diagonal and run serial merge for each diagonal (as before).
    But when we find split point on the current diagonal, we setup some indexes limits for rng1 and 'rng2'.
    For these limits we load split point's data from previous and next "base" diagonals, calculated on the step (1).

Applying this approach we have good perf profit for biggest data sizes with float and int data types.

As additional profit, we have sign performance boost for small and middle data sizes in the merge_sort algorithm.

@SergeyKopienko SergeyKopienko added this to the 2022.8.0 milestone Nov 6, 2024
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_to_main branch 5 times, most recently from a6164fd to d4721ca Compare November 7, 2024 12:24
auto __scratch_idx = __global_idx / __base_diag_part;

_split_point_t __start;
if (__global_idx % __base_diag_part != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline about the approach to partition based on SLM size and then work within the partitioned blocks in the second kernel.

One advantage of this method (beyond working within SLM for all diagonals in this kernel) would be that there would be no work-item divergence with a branch and mod operation like this. The first partitioning kernel would be lightweight and basically only to establish bounds for the second kernel. Then the second kernel would work within SLM loaded data and search for all diagonals within that block then serial merge, and all work items could be the same (with possible exception for the zeroth work item).

std::forward<_ExecutionPolicy>(__exec), std::forward<_Range1>(__rng1), std::forward<_Range2>(__rng2),
std::forward<_Range3>(__rng3), __comp);
}
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this else branch and its kernel? std::uint32_t has a well-defined maximum, and we know __n < 4 * 1'048'576 in this branch, so it can always be indexed with this type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
But another solution - we may use std::uint16_t type too for smaller data sizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an estimate of how much performance benefit the 16-bit indexing kernel brings? I think it would be best to weigh the impact of this kernel against the increase in JIT time. If the performance benefit is significant, then I am in favor of keeping it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No perf profit, removed.

@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_to_main branch from 0eb4649 to 129898c Compare November 18, 2024 15:10
…introduce new function __find_start_point_in

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…introduce __parallel_merge_submitter_large for merge of biggest data sizes

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…using __parallel_merge_submitter_large for merge data equal or greater then 4M items

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix compile error

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix Kernel names

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…rename template parameter names in __parallel_merge_submitter

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix review comment

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_to_main branch from 8f756f0 to 1b6cd34 Compare November 19, 2024 08:24
…fix review comment

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants