-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cancel parent context if timeout on sending to results channel #68
Comments
Another very subtle thing about using WithTimeout, is that the timer starts as soon as this function returns. This means that only the worker can start the timer when it just posted the output to the output channel. My initial thought about building the timeout into the parent context is therefore the incorrect choice, because only the worker knows when the output has been sent, the parent has no idea and nor should it and there is a disconnect between when the parent would start the timer and when the worker sends the output. Many things can happen in between these points and it would be silly to try and account for this time delay into the send timeout. So the crux of this point is that the parent sends in its context, the associated cancel function and an output write time duration value. The worker will receive all of these, create its own child context from the parent using WithTimeout and in the select statement, if the timeout is triggered, it invokes the parent cancel function which will trigger a full cancellation of the whole worker pool. The worker seeing the timeout, will immediately exit its run loop and send its finished notification to the pool. The pool will at this point not be in its drain phase, so will not read this notification yet. First it will react to the cancellation triggered by the worker. At this point it will close the forwarding jobs channel. If one worker experiences a timeout on the output send, then all others will experience the same timeout. However, all the other workers will see the cancellation and exit their run loop. If the worker is already executing its job at the time the cancellation comes in, then it will not immediately see this cancellation, so there could be a reaction delay if the job is long running (relatively). The producer will be running under the same parent context, so will also stop producing jobs. The fact that you can't store context in a struct means that implementing this fix will have rippling effects, but this can't be avoided. |
the following test .... FEntry(nil, &poolTE{ // Intermittent
// timeout on error
//
given: "finish by cancel",
should: "receive and process all",
now: 2,
interval: 8000,
outputsChSize: OutputsChSize,
after: time.Second / 5,
outputChTimeout: time.Second * 10,
finish: finishWithCancel,
summarise: summariseWithConsumer,
assert: assertCancelled,
}), fails intermittently:
... and this is what we see when the same test suceeds:
|
Actually, to be clear about the above, we DO want an error to be returned, but the test case is coded for a none error, which clearly is fixable. The problem here is the intermittency. To make the unit test more robust, we must make sure that the consumer is not consuming. We can see from the console output that it is consuming, which makes it harder to force the condition we're trying to test. It also looks like the worker is seeing a Done:
... and the worker code:
The 'done received 💥💥💥' is short circuiting the select and does not allow the timeout to occur. But what we can also see, is that there is still a send timeout occurring, but on a different worker. So the timeout is working, but the problem is the way that the worker pool to reporting its final error, because of the race condition between the workers. The pool has to handle the worker cancellations and correctly decide what the proper error result should be. |
But sometimes we see this:
... there is no timeout on send, but also no done received. I think we need to fix 2 problems here, first how the pool handles workers termination notifcations and the consumer shouldnt be consuming. |
In the unit tests, we need to make it clear what the interval relates to. At the moment having a variable called interval is non specific. We need a producer and consumer interval, to make it much clearer. producer: pipe.produce(parentContext, interval, provider) consumer
This needs to be fixed. The consumer is not configured with an interval. But we can see the extent to which the consumer is consuming by looking at the count on the consumer at the end, eg:
|
Need to turn the intervals into durations as they are currently just an int. |
after refactoring the test durations, we have this failing test (the same one as indicated before), but now:
which results in failure (no error is returned, even though we expect a timeout):
What this shows us is:
With these attributes, one would expect that the worker should bail after a second attempting to send the result to the output channel, but it doesn't. The thing that controls the test run length is the consumer interval, so our worker is not timing out The relevant worker code: func (w *worker[I, O]) invoke(parentContext context.Context,
parentCancel context.CancelFunc,
outputChTimeout time.Duration,
job Job[I],
) error {
var err error
outputContext, cancel := context.WithTimeout(parentContext, outputChTimeout)
defer cancel()
result, _ := w.exec(job)
if w.outputsChOut != nil {
fmt.Printf(" ---> 🚀 worker.invoke ⏰ output timeout: '%v'\n", outputChTimeout)
select {
case w.outputsChOut <- result:
case <-parentContext.Done():
fmt.Printf(" ---> 🚀 worker.invoke(%v)(cancel) - done received 💥💥💥\n", w.id)
case <-outputContext.Done():
// THIS SHOULD TIMEOUT, BUT DOES NOT, THIS IS HERE TO ENSURE
// THAT THE RESULT SEND ON w.outputsChOut HAPPENS IN A TIMELY
// MANNER.
fmt.Printf(" ---> 🚀 worker.invoke(%v)(cancel) - timeout on send 👿👿👿\n", w.id)
err = errors.New("timeout on send")
parentCancel()
}
}
return err
} |
this looks to have been fixed by issue #91 - ie using the atomic.AddInt32 function to increment and decrement the wait group counter, instead of using the unsafe ++/-- operators. With fix:
|
The remining work to do on this is to check the deactivated tests to ensure they all execute properly. |
We may be able to improve the design of this feature. We should not have to pass in the parent cancel function to the child. All we need to do is set a timeout on the parent context using context.WithTimeout. |
This should be addressed in a fresh issue. |
This is related to extendio issue 303. If the client needs to consume the results of the worker pool, then they will need to ensure that the results channel is consumed. However, if for some reason the client is miscoded, ie provides a results channel but doesn't consume it, then the worker pool will simply deadlock.
The deadlock should be avoided and can be by introducing a timeout on the result channel send by the worker. If this timeout occurs, then a parent cancel function should be invoked. To achieve this, the client will need to specify a time duration value representing the required timeout (with a sensible default) and pass in the parent context along with its associated cancel function obtained from context.WithCancel.
Or perhaps, the client only needs to send a parent context that has timeout built into it (context.WithTimeout). Try this option too to see what's best.
In order to complete this, we need a way to detect the execution result of the pool itself. We can do this by introducing a new channel that the pool writes to after it completes. The client then reads the result from this channel after progressing past the wait group. Each worker can send an addition reason back to the pool to indicate the nature of its finish notification. Ordinarily, the worker will finish as there are no more jobs. In this case it will return a nil error. If there is a timeout on sending to the output channel, then it will cancel it's parent, then end it loop sending back a timeout error. With this in place, we can then write meaningful test cases that can discriminate the reason for the end of the worker pool processing and assert accordingly.
The pool result channel must buffered, so that it wont be blocked attempting to write to it. It must also write this result before invoking Done on the wait group.
The text was updated successfully, but these errors were encountered: