Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure ev/gather fibers are fully canceled on error #1181

Merged
merged 1 commit into from
Jun 4, 2023

Conversation

chris-chambers
Copy link
Contributor

When an error occurs in an ev/gather form, it is possible that not all the sibling fibers are canceled. This can lead to process hangs, if an uncanceled fiber blocks.

The problem is that each is not well-defined for tables that are being mutated. So, sometimes the (put fibers f nil) in cancel-all causes iteration to skip entries. It's not deterministic since the hash of a fiber is based on its address. For integers, it is deterministic, and you can see the effect here:

(def t @{1 1
         2 2
         3 3})
(each x t
  (put t x nil))
(pp t) # => @{2 2 3 3}

A script like this can be used to reproduce the process hangs:

#!/bin/bash

# To use a different janet:
#     JANET=./build/janet ./gather-test
: ${JANET:=janet}
attempts=100
hangs=0

for i in $(seq $attempts); do
    timeout 0.5 "$JANET" -e '
(defn hang [] (ev/take (ev/chan)))
(protect
  (ev/gather
   (error :cancel)
   (hang)
   (hang)
   (hang)
   # Three hanging fibers is plenty, but adding more increases the frequency.
   # (hang)
   # (hang)
   # (hang)
   # (hang)
   ))
' || (( hangs++ ))
done

echo "attempts: $attempts"
echo "hangs:    $hangs"

This PR fixes the cancellation problem by first iterating the collection of fibers to call ev/cancel, then clearing the table all at once after. It also changes the semantics of ev/gather to guarantee that all canceled fibers have completed before returning. I think this is desirable but could be convinced otherwise.

@bakpakin
Copy link
Member

bakpakin commented Jun 4, 2023

Thanks, good catch - this is indeed a bug.

It also changes the semantics of ev/gather to guarantee that all canceled fibers have completed before returning. I think this is desirable but could be convinced otherwise.

By default it is probably the right thing to do. I think the old behavior could be useful in some cases, say there is some cleanup behavior that can happen in the background but is certainly more prone to causing bugs. The idea is ev/gather should implement structured concurrency and the new behavior is better in this regard.

@bakpakin bakpakin merged commit 88ba99b into janet-lang:master Jun 4, 2023
@bakpakin bakpakin added the bug This is not expected behavior, and needs fixing label Jun 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This is not expected behavior, and needs fixing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants