Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: handle segfault with hierarchical kernel when team size is larger than available threads #185

Open
tylerjereddy opened this issue Mar 19, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@tylerjereddy
Copy link
Contributor

tylerjereddy commented Mar 19, 2023

As noted here: #146 (comment)

If you write a PyKokkos kernel that uses a team barrier synchronization and probably other related hierarchical parallelism features, it seems that you can get a hard segfault if you have OMP_NUM_THREADS=1 in your environment.

While Kokkos core probably has a case for not behaving so well here, since the code is already compiled, if we have ahead-of-compile-time knowledge of the number of threads that will be available, I wonder if we should do something more useful than segfaulting by default.

I checked that deleting the barrier syncs isn't sufficient to make the segfault go away, so something broader about the hierarchical kernel is likely to blame.

Copy of the crashing workunit below the fold, in case it gets mutated a lot in the matching PR:

@pk.workunit
def dgemm_impl_tiled_no_view_c(team_member: pk.TeamMember,
                               k_a: int,
                               alpha: float,
                               view_a: pk.View2D[pk.double],
                               view_b: pk.View2D[pk.double],
                               out: pk.View2D[pk.double]):
    printf("tiled workunit checkpoint 1")
    # early attempt at tiled matrix multiplication in PyKokkos

    # for now, let's assume a 2x2 tiling arrangement and
    # that `view_a`, `view_b`, and `out` views are all 4 x 4 matrices
    tile_size: int = 4 # this is really just the team size...
    width: int = 4

    # start off by getting a global thread id
    global_tid: int = team_member.league_rank() * team_member.team_size() + team_member.team_rank()
    printf("tiled workunit checkpoint 2 for thread id: %d\n", global_tid)

    # TODO: I have no idea how to get 2D scratch memory views?
    scratch_mem_a: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
    scratch_mem_b: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
    printf("tiled workunit checkpoint 3 for thread id: %d\n", global_tid)
    # in a 4 x 4 matrix with 2 x 2 tiling the leagues
    # and teams have matching row/col assignment approaches
    bx: int = team_member.league_rank() / 2
    by: int = 0
    if team_member.league_rank() % 2 != 0:
        by = 1
    tx: int = team_member.team_rank() / 2
    ty: int = 0
    if team_member.team_rank() % 2 != 0:
        ty = 1
    tmp: float = 0
    col: int = by * 2 + ty
    row: int = bx * 2 + tx
    printf("tiled workunit checkpoint 4 for thread id: %d\n", global_tid)

    # these variables are a bit silly--can we not get
    # 2D scratch memory indexing?
    a_index: int = 0
    b_index: int = 0

    for i in range(out.extent(1) / 2):
        scratch_mem_a[team_member.team_rank()] = view_a[row][i * 2 + ty]
        scratch_mem_b[team_member.team_rank()] = view_b[i * 2 + tx][col]
        printf("tiled workunit checkpoint 5 for thread id: %d\n", global_tid)
        team_member.team_barrier()
        printf("tiled workunit checkpoint 6 for thread id: %d\n", global_tid)

        for k in range(2):
            a_index = k + ((team_member.team_rank() // 2) * 2)
            b_index = ty + (k * 2)
            tmp += scratch_mem_a[a_index] * scratch_mem_b[b_index]
            team_member.team_barrier()
            printf("tiled workunit checkpoint 7 for thread id: %d\n", global_tid)

    printf("tiled workunit checkpoint 8 for thread id: %d\n", global_tid)
    out[row][col] = tmp
@tylerjereddy tylerjereddy added the bug Something isn't working label Mar 19, 2023
@tylerjereddy tylerjereddy changed the title BUG: handle segfault with barrier sync on single thread BUG: handle segfault with hierarchical kernel on single thread Mar 19, 2023
@tylerjereddy
Copy link
Contributor Author

More specifically, if the number of threads set is smaller than the size of a single team, it seems that a segfault is likely.

@tylerjereddy
Copy link
Contributor Author

Which also reminds me--we may need to use pytest skip marks/directives for some tests to be portable/reasonably performant if they require larger team sizes/thread counts. We'd have to think about that a bit re: whether something would segfault if only 2 physical cores were available with a team size of 8, or if we could just require that many threads but expect it to run slowly or keep the test cases small enough at least to not slow down too much

@tylerjereddy tylerjereddy changed the title BUG: handle segfault with hierarchical kernel on single thread BUG: handle segfault with hierarchical kernel when team size is larger than available threads Mar 19, 2023
@tylerjereddy
Copy link
Contributor Author

I think Christian mentioned that even in the core C++ this should produce a proper runtime error, not segfault.

@JBludau
Copy link
Contributor

JBludau commented Mar 21, 2023

hmmm so far I did not get the debugger to help me much but rather drain me in messages

@JBludau
Copy link
Contributor

JBludau commented Mar 22, 2023

So, the debugger tells that Kokkos core errors out with Kokkos::abort("Kokkos::abort: Requested Team Size is too large!"); Unfortunately the message does not make it up through python. Not sure if this might give a segfault because the message is no longer reachable in the end (it was passed in as char*)

@JBludau
Copy link
Contributor

JBludau commented Mar 22, 2023

It kind of is a complex control flow if the library we load calls abort and it propagates through different layers of python. To me this looks neither transparent nor would I know anything we can do once the abort is called...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants