BUG: handle segfault with hierarchical kernel when team size is larger than available threads #185

tylerjereddy · 2023-03-19T01:50:28Z

If you write a PyKokkos kernel that uses a team barrier synchronization and probably other related hierarchical parallelism features, it seems that you can get a hard segfault if you have OMP_NUM_THREADS=1 in your environment.

While Kokkos core probably has a case for not behaving so well here, since the code is already compiled, if we have ahead-of-compile-time knowledge of the number of threads that will be available, I wonder if we should do something more useful than segfaulting by default.

I checked that deleting the barrier syncs isn't sufficient to make the segfault go away, so something broader about the hierarchical kernel is likely to blame.

Copy of the crashing workunit below the fold, in case it gets mutated a lot in the matching PR:

@pk.workunit
def dgemm_impl_tiled_no_view_c(team_member: pk.TeamMember,
                               k_a: int,
                               alpha: float,
                               view_a: pk.View2D[pk.double],
                               view_b: pk.View2D[pk.double],
                               out: pk.View2D[pk.double]):
    printf("tiled workunit checkpoint 1")
    # early attempt at tiled matrix multiplication in PyKokkos

    # for now, let's assume a 2x2 tiling arrangement and
    # that `view_a`, `view_b`, and `out` views are all 4 x 4 matrices
    tile_size: int = 4 # this is really just the team size...
    width: int = 4

    # start off by getting a global thread id
    global_tid: int = team_member.league_rank() * team_member.team_size() + team_member.team_rank()
    printf("tiled workunit checkpoint 2 for thread id: %d\n", global_tid)

    # TODO: I have no idea how to get 2D scratch memory views?
    scratch_mem_a: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
    scratch_mem_b: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
    printf("tiled workunit checkpoint 3 for thread id: %d\n", global_tid)
    # in a 4 x 4 matrix with 2 x 2 tiling the leagues
    # and teams have matching row/col assignment approaches
    bx: int = team_member.league_rank() / 2
    by: int = 0
    if team_member.league_rank() % 2 != 0:
        by = 1
    tx: int = team_member.team_rank() / 2
    ty: int = 0
    if team_member.team_rank() % 2 != 0:
        ty = 1
    tmp: float = 0
    col: int = by * 2 + ty
    row: int = bx * 2 + tx
    printf("tiled workunit checkpoint 4 for thread id: %d\n", global_tid)

    # these variables are a bit silly--can we not get
    # 2D scratch memory indexing?
    a_index: int = 0
    b_index: int = 0

    for i in range(out.extent(1) / 2):
        scratch_mem_a[team_member.team_rank()] = view_a[row][i * 2 + ty]
        scratch_mem_b[team_member.team_rank()] = view_b[i * 2 + tx][col]
        printf("tiled workunit checkpoint 5 for thread id: %d\n", global_tid)
        team_member.team_barrier()
        printf("tiled workunit checkpoint 6 for thread id: %d\n", global_tid)

        for k in range(2):
            a_index = k + ((team_member.team_rank() // 2) * 2)
            b_index = ty + (k * 2)
            tmp += scratch_mem_a[a_index] * scratch_mem_b[b_index]
            team_member.team_barrier()
            printf("tiled workunit checkpoint 7 for thread id: %d\n", global_tid)

    printf("tiled workunit checkpoint 8 for thread id: %d\n", global_tid)
    out[row][col] = tmp

The text was updated successfully, but these errors were encountered:

tylerjereddy · 2023-03-19T02:02:08Z

More specifically, if the number of threads set is smaller than the size of a single team, it seems that a segfault is likely.

tylerjereddy · 2023-03-19T02:30:38Z

Which also reminds me--we may need to use pytest skip marks/directives for some tests to be portable/reasonably performant if they require larger team sizes/thread counts. We'd have to think about that a bit re: whether something would segfault if only 2 physical cores were available with a team size of 8, or if we could just require that many threads but expect it to run slowly or keep the test cases small enough at least to not slow down too much

tylerjereddy · 2023-03-20T15:10:57Z

I think Christian mentioned that even in the core C++ this should produce a proper runtime error, not segfault.

JBludau · 2023-03-21T13:11:26Z

hmmm so far I did not get the debugger to help me much but rather drain me in messages

JBludau · 2023-03-22T18:51:34Z

So, the debugger tells that Kokkos core errors out with Kokkos::abort("Kokkos::abort: Requested Team Size is too large!"); Unfortunately the message does not make it up through python. Not sure if this might give a segfault because the message is no longer reachable in the end (it was passed in as char*)

JBludau · 2023-03-22T19:01:44Z

It kind of is a complex control flow if the library we load calls abort and it propagates through different layers of python. To me this looks neither transparent nor would I know anything we can do once the abort is called...

tylerjereddy added the bug Something isn't working label Mar 19, 2023

tylerjereddy changed the title ~~BUG: handle segfault with barrier sync on single thread~~ BUG: handle segfault with hierarchical kernel on single thread Mar 19, 2023

tylerjereddy changed the title ~~BUG: handle segfault with hierarchical kernel on single thread~~ BUG: handle segfault with hierarchical kernel when team size is larger than available threads Mar 19, 2023

tylerjereddy mentioned this issue Mar 19, 2023

ENH: DGEMM workunits #146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: handle segfault with hierarchical kernel when team size is larger than available threads #185

BUG: handle segfault with hierarchical kernel when team size is larger than available threads #185

tylerjereddy commented Mar 19, 2023 •

edited

Loading

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 20, 2023

JBludau commented Mar 21, 2023

JBludau commented Mar 22, 2023 •

edited

Loading

JBludau commented Mar 22, 2023

BUG: handle segfault with hierarchical kernel when team size is larger than available threads #185

BUG: handle segfault with hierarchical kernel when team size is larger than available threads #185

Comments

tylerjereddy commented Mar 19, 2023 • edited Loading

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 19, 2023

tylerjereddy commented Mar 20, 2023

JBludau commented Mar 21, 2023

JBludau commented Mar 22, 2023 • edited Loading

JBludau commented Mar 22, 2023

tylerjereddy commented Mar 19, 2023 •

edited

Loading

JBludau commented Mar 22, 2023 •

edited

Loading