Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callback function not executed on first jobs #40

Open
mjrafferty opened this issue Apr 15, 2020 · 12 comments · May be fixed by #45
Open

Callback function not executed on first jobs #40

mjrafferty opened this issue Apr 15, 2020 · 12 comments · May be fixed by #45

Comments

@mjrafferty
Copy link

First I want to thank you for sharing this project, it's a great time saver for creating tools that run asynchronous jobs. I'm using this library in a prompt framework I created:

https://github.com/mjrafferty/apollo-zsh-theme

I'm seeing an issue (though only on my raspberry pi) after the changes in #37 where the callback function isn't being executed on the first prompt load after starting the async worker. Rolling back to ASYNC_VERSION 1.7.2 resolves the issue. I added some debug logging to the _async_job function, and the jobs are still being run and returning good output, but the callback function isn't getting run. The next time the async jobs are run the callback function returns output just fine.

Since I'm only able to reproduce this on a raspberry pi, my guess is something is just lagging a little behind in registering the callback, but adding various sleeps to the code didn't seem to resolve the issue. Are you able to shed some light on how the changes made in #37 might impact that?

@mafredri
Copy link
Owner

Hi @mjrafferty, thanks for reporting.

Could you explain how async is being used in the apollo theme? I took a quick look at the code but things are split up and I have a hard time building a mental model of what order things are happening in.

Generally I'd recommend registering the callback as soon as the async worker is started, that way you shouldn't miss out on any messages. Also, with notify mode (-n) and a registered callback, there's no need to manually invoke async_process_results, I saw it being used here:

https://github.com/mjrafferty/apollo-zsh-theme/blob/bf8e12251df149bb10f3962b09d40bdcc471f2b9/functions/__apollo_precmd#L29-L30

@mafredri
Copy link
Owner

Btw, what version of Zsh are you using on the rpi?

@mjrafferty
Copy link
Author

I'm using zsh 5.8 on the pi, though It works just fine on my mac which is also using 5.8. The async_process_results call you see there is likely mostly pointless. I start the async jobs prior to that during the __apollo_run_prompt function a few lines above that. At that point in the code I don't have a callback function set, as that gets unregistered in preexec and then registered again right there in precmd. The intended purpose of the async_process_results call there is to process anything that might be finished before drawing the prompt and without interrupting precmd execution before it's ready for the output.

The async library is loaded and the worker started here:

https://github.com/mjrafferty/apollo-zsh-theme/blob/bf8e12251df149bb10f3962b09d40bdcc471f2b9/functions/__apollo_load_utilities#L112

And the jobs are initiated during precmd before registereing the callback function here:

https://github.com/mjrafferty/apollo-zsh-theme/blob/bf8e12251df149bb10f3962b09d40bdcc471f2b9/functions/__apollo_get_module_text#L42

@mafredri
Copy link
Owner

mafredri commented Apr 16, 2020

Hmm, that's interesting.

I have two suggestions:

  1. Could you test if it's reproducible with only 32548d3 which excludes the two latter commits that are part of 1.8.0?
  2. On 1.8.0 would the following help:

Adding:

async_process_results $worker $ASYNC_CALLBACKS[$worker] watcher

After this if-statement:

zsh-async/async.zsh

Lines 463 to 473 in a9ca051

if [[ ! -o interactive ]] || [[ ! -o zle ]]; then
trap '_async_notify_trap' WINCH
elif [[ -o interactive ]] && [[ -o zle ]]; then
local fd w
for fd w in ${(@kv)ASYNC_PTYS}; do
if [[ $w == $worker ]]; then
zle -F $fd _async_zle_watcher # Register the ZLE handler.
break
fi
done
fi

I'll try to look into this some more but in the mean time these could provide some clues.

@mjrafferty
Copy link
Author

It is still present on 32548d3, and unfortunately that async_process_results call didn't have an impact.

I'm testing specifically with a call that asynchronously makes a curl request to icanhazip.com, and I did however find out that the async_process_results in my code is usually what catches that specific output even though I mentioned before that I thought it was probably useless. Though when testing this with 1.7.2 I can just as easily remove that line and the callback function will instead update my prompt as expected. With 1.8.0 it seems neither of those work on the first run.

I added debug code to dump this to a log file:

print -r -n - "$out"

I then added a second line to send output to that same log file when the async_process_results function is run. Doing this I was able to confirm that the async job was finished prior to the call to get the results, yet no data was available. I added logging here to confirm that:

zsh-async/async.zsh

Lines 294 to 299 in a9ca051

while zpty -r -t $worker data 2>/dev/null; do
ASYNC_PROCESS_BUFFER[$worker]+=$data
len=${#ASYNC_PROCESS_BUFFER[$worker]}
pos=${ASYNC_PROCESS_BUFFER[$worker][(i)$null]} # Get index of NULL-character (delimiter).
# Keep going until we find a NULL-character.

The while loop executes once, but the $data variable is empty despite the async_job being finished. So it looks like this isn't an issue with executing the callback function as I initially thought, but rather the first command isn't sending back any output from the worker for some reason even though it ran and produced output.

@mjrafferty
Copy link
Author

I did a bit more testing on this by disabling the async_process_results and async_register_callback calls in my code. I then checked the output of the job manually using zpty -r apollo_async_helper and it looks like it is getting partial data. Here is the output:

\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Dload\ \ Upload\ \ \ Total\ \ \ Spent\ \ \ \ Left\ \ Speed$'\n'$'\r'\ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ \ 0\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ \ \ 0$'\r'\ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ \ 0\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ \ \ 0$'\r'100\ \ \ \ 15\ \ 100\ \ \ \ 15\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 192\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ 220

That looks to be a part of the stderr output from the curl request, but all of the data that should have been before it is missing. Here is the comparison from the second time the job runs that has all of the data:

__apollo_public_ip_async_runner 0 public_ip\\069.244.159.151\\0default 0.2871527672 \ \ %\ Total\ \ \ \ %\ Received\ %\ Xferd\ \ Average\ Speed\ \ \ Time\ \ \ \ Time\ \ \ \ \ Time\ \ Current$'\n'\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Dload\ \ Upload\ \ \ Total\ \ \ Spent\ \ \ \ Left\ \ Speed$'\n'$'\r'\ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ \ 0\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ \ \ 0$'\r'100\ \ \ \ 15\ \ 100\ \ \ \ 15\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 153\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ 272%

So it seems like something is either truncating the data during just after startup or it just isn't ready for writing yet. I was able to add a 1 second sleep in the async job to delay the output and it works correctly.

Would any of the recent code changes have slowed the startup time for the worker, or is there anything run that flushes the output after startup?

@mafredri
Copy link
Owner

mafredri commented Apr 21, 2020

Thanks for digging @mjrafferty, I can't seem to reproduce this so it helps.

There's only one change I can think of that would affect this behavior, I've reverted it in this branch if you'd like to test it: https://github.com/mafredri/zsh-async/tree/issue-40

Normally I would also have suspected async_job $worker _unset_trap which was removed in 361dc17 but you said it happens even without that change.

We could also try to add a simple print statement here (prints a newline), or print -n $'\0' if that causes problems:

My thinking is that it might make a difference if we prime the zpty fd by outputting something from the worker. We shouldn't have to, but zpty's are a bit fidgety 😅.

@mjrafferty
Copy link
Author

Unfortunately neither of those fixes the issue. I think it's kind of a perfect storm of timing. I think the zpty processes the job before it's even finished setting itself up. The fact that it's a slow raspberry pi combined with the worker inheriting enough extra junk from my prompt slows it down just enough to cause an issue, since even simplifying my prompt to just this one module fixes it. I found that instead of adding a print statement in the location you mentioned, adding a sleep 0.1 is enough to resolve the issue. I guess it slows down worker execution just enough for it to finish setting up.

In terms of the changes made here since 1.7.2, I reverted this line:

_async_send_job $0 $worker "$cmd"

to the following:

zpty -w $worker "$cmd"$'\0'

This also resolves the issue without the sleep being used. So this narrowed it down to the _async_send_job function itself. Removing the zpty -t block from the function resolves this as well. So it seems that being added is what caused the regression.

I don't particularly understand why that's causing the truncated output, but at least we know what change prompted the issue now.

1 similar comment
@mjrafferty
Copy link
Author

Unfortunately neither of those fixes the issue. I think it's kind of a perfect storm of timing. I think the zpty processes the job before it's even finished setting itself up. The fact that it's a slow raspberry pi combined with the worker inheriting enough extra junk from my prompt slows it down just enough to cause an issue, since even simplifying my prompt to just this one module fixes it. I found that instead of adding a print statement in the location you mentioned, adding a sleep 0.1 is enough to resolve the issue. I guess it slows down worker execution just enough for it to finish setting up.

In terms of the changes made here since 1.7.2, I reverted this line:

_async_send_job $0 $worker "$cmd"

to the following:

zpty -w $worker "$cmd"$'\0'

This also resolves the issue without the sleep being used. So this narrowed it down to the _async_send_job function itself. Removing the zpty -t block from the function resolves this as well. So it seems that being added is what caused the regression.

I don't particularly understand why that's causing the truncated output, but at least we know what change prompted the issue now.

@mafredri
Copy link
Owner

Interesting find! This is definitely not the first time we've had to deal with racy behavior of zptys and it's hard to figure out why they happen. The zpty implementation is kind of a hack, unfortunately.

I reverted the corruption check and pushed two more commits to the issue-40 branch. I wonder if they make any difference? I'm guessing they wont but one can hope.

There's also a test in the async test suite that can trigger racy behavior:

zsh-async/async_test.zsh

Lines 129 to 152 in a9ca051

test_async_process_results_stress() {
# NOTE: This stress test does not always pass properly on older versions of
# zsh, sometimes writing to zpty can hang and other times reading can hang,
# etc.
local -a r
cb() { r+=("$@") }
async_start_worker test
t_defer async_stop_worker test
integer iter=40 timeout=5
for i in {1..$iter}; do
async_job test "print -n $i"
done
float start=$EPOCHSECONDS
while (( $#r / 6 < iter )); do
async_process_results test cb
(( EPOCHSECONDS - start > timeout )) && {
t_log "timed out after ${timeout}s"
t_fatal "wanted $iter results, got $(( $#r / 6 ))"
}
done

When iter is bumped to 100 or more, commands start failing and sometimes the worker sees them entirely as missing or overlapping. Think print 80, print 81, prprint 83, etc. It's possible these two issues are related but I've spend a lot of time debugging that as well, to no avail :/.

@mafredri
Copy link
Owner

@mjrafferty I may have a fix for this in #45, if you still have a way to test this, it'd be very much appreciated.

What I'm thinking, specifically, could affect this is dfca4ba. It was a fix to an observation I made where the zpty was being written to from both ends at the same time.

In essence, the first job completes (and prints) while the second job is being sent to the async worker. This resulted in the first job result being erased.

@mafredri mafredri linked a pull request Oct 22, 2020 that will close this issue
@mjrafferty
Copy link
Author

Unfortunately the only device I had this issue on is no longer in use. Based on prior observations though that does sound promising. I can at least confirm that using #45 didn't seem to cause any regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants