Callback function not executed on first jobs #40

mjrafferty · 2020-04-15T17:11:24Z

First I want to thank you for sharing this project, it's a great time saver for creating tools that run asynchronous jobs. I'm using this library in a prompt framework I created:

https://github.com/mjrafferty/apollo-zsh-theme

I'm seeing an issue (though only on my raspberry pi) after the changes in #37 where the callback function isn't being executed on the first prompt load after starting the async worker. Rolling back to ASYNC_VERSION 1.7.2 resolves the issue. I added some debug logging to the _async_job function, and the jobs are still being run and returning good output, but the callback function isn't getting run. The next time the async jobs are run the callback function returns output just fine.

Since I'm only able to reproduce this on a raspberry pi, my guess is something is just lagging a little behind in registering the callback, but adding various sleeps to the code didn't seem to resolve the issue. Are you able to shed some light on how the changes made in #37 might impact that?

mafredri · 2020-04-15T18:12:18Z

Hi @mjrafferty, thanks for reporting.

Could you explain how async is being used in the apollo theme? I took a quick look at the code but things are split up and I have a hard time building a mental model of what order things are happening in.

Generally I'd recommend registering the callback as soon as the async worker is started, that way you shouldn't miss out on any messages. Also, with notify mode (-n) and a registered callback, there's no need to manually invoke async_process_results, I saw it being used here:

https://github.com/mjrafferty/apollo-zsh-theme/blob/bf8e12251df149bb10f3962b09d40bdcc471f2b9/functions/__apollo_precmd#L29-L30

mafredri · 2020-04-15T18:13:28Z

Btw, what version of Zsh are you using on the rpi?

mjrafferty · 2020-04-15T18:51:24Z

I'm using zsh 5.8 on the pi, though It works just fine on my mac which is also using 5.8. The async_process_results call you see there is likely mostly pointless. I start the async jobs prior to that during the __apollo_run_prompt function a few lines above that. At that point in the code I don't have a callback function set, as that gets unregistered in preexec and then registered again right there in precmd. The intended purpose of the async_process_results call there is to process anything that might be finished before drawing the prompt and without interrupting precmd execution before it's ready for the output.

The async library is loaded and the worker started here:

https://github.com/mjrafferty/apollo-zsh-theme/blob/bf8e12251df149bb10f3962b09d40bdcc471f2b9/functions/__apollo_load_utilities#L112

And the jobs are initiated during precmd before registereing the callback function here:

https://github.com/mjrafferty/apollo-zsh-theme/blob/bf8e12251df149bb10f3962b09d40bdcc471f2b9/functions/__apollo_get_module_text#L42

mafredri · 2020-04-16T14:38:06Z

Hmm, that's interesting.

I have two suggestions:

Could you test if it's reproducible with only 32548d3 which excludes the two latter commits that are part of 1.8.0?
On 1.8.0 would the following help:

Adding:

async_process_results $worker $ASYNC_CALLBACKS[$worker] watcher

After this if-statement:

zsh-async/async.zsh

Lines 463 to 473 in a9ca051

    
           if [[ ! -o interactive ]] || [[ ! -o zle ]]; then 
        
           	trap '_async_notify_trap' WINCH 
        
           elif [[ -o interactive ]] && [[ -o zle ]]; then 
        
           	local fd w 
        
           	for fd w in ${(@kv)ASYNC_PTYS}; do 
        
           		if [[ $w == $worker ]]; then 
        
           			zle -F $fd _async_zle_watcher  # Register the ZLE handler. 
        
           			break 
        
           		fi 
        
           	done 
        
           fi

I'll try to look into this some more but in the mean time these could provide some clues.

mjrafferty · 2020-04-16T19:18:50Z

It is still present on 32548d3, and unfortunately that async_process_results call didn't have an impact.

I'm testing specifically with a call that asynchronously makes a curl request to icanhazip.com, and I did however find out that the async_process_results in my code is usually what catches that specific output even though I mentioned before that I thought it was probably useless. Though when testing this with 1.7.2 I can just as easily remove that line and the callback function will instead update my prompt as expected. With 1.8.0 it seems neither of those work on the first run.

I added debug code to dump this to a log file:

zsh-async/async.zsh

Line 60 in a9ca051

print -r -n - "$out"

I then added a second line to send output to that same log file when the async_process_results function is run. Doing this I was able to confirm that the async job was finished prior to the call to get the results, yet no data was available. I added logging here to confirm that:

zsh-async/async.zsh

Lines 294 to 299 in a9ca051

    
           while zpty -r -t $worker data 2>/dev/null; do 
        
           	ASYNC_PROCESS_BUFFER[$worker]+=$data 
        
           	len=${#ASYNC_PROCESS_BUFFER[$worker]} 
        
           	pos=${ASYNC_PROCESS_BUFFER[$worker][(i)$null]}  # Get index of NULL-character (delimiter). 
        
           	# Keep going until we find a NULL-character.

The while loop executes once, but the $data variable is empty despite the async_job being finished. So it looks like this isn't an issue with executing the callback function as I initially thought, but rather the first command isn't sending back any output from the worker for some reason even though it ran and produced output.

mjrafferty · 2020-04-18T19:52:22Z

I did a bit more testing on this by disabling the async_process_results and async_register_callback calls in my code. I then checked the output of the job manually using zpty -r apollo_async_helper and it looks like it is getting partial data. Here is the output:

\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Dload\ \ Upload\ \ \ Total\ \ \ Spent\ \ \ \ Left\ \ Speed$'\n'$'\r'\ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ \ 0\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ \ \ 0$'\r'\ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ \ 0\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ \ \ 0$'\r'100\ \ \ \ 15\ \ 100\ \ \ \ 15\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 192\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ 220

That looks to be a part of the stderr output from the curl request, but all of the data that should have been before it is missing. Here is the comparison from the second time the job runs that has all of the data:

__apollo_public_ip_async_runner 0 public_ip\\069.244.159.151\\0default 0.2871527672 \ \ %\ Total\ \ \ \ %\ Received\ %\ Xferd\ \ Average\ Speed\ \ \ Time\ \ \ \ Time\ \ \ \ \ Time\ \ Current$'\n'\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Dload\ \ Upload\ \ \ Total\ \ \ Spent\ \ \ \ Left\ \ Speed$'\n'$'\r'\ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ \ \ 0\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ \ \ 0$'\r'100\ \ \ \ 15\ \ 100\ \ \ \ 15\ \ \ \ 0\ \ \ \ \ 0\ \ \ \ 153\ \ \ \ \ \ 0\ --:--:--\ --:--:--\ --:--:--\ \ \ 272%

So it seems like something is either truncating the data during just after startup or it just isn't ready for writing yet. I was able to add a 1 second sleep in the async job to delay the output and it works correctly.

Would any of the recent code changes have slowed the startup time for the worker, or is there anything run that flushes the output after startup?

mafredri · 2020-04-21T08:52:52Z

Thanks for digging @mjrafferty, I can't seem to reproduce this so it helps.

There's only one change I can think of that would affect this behavior, I've reverted it in this branch if you'd like to test it: https://github.com/mafredri/zsh-async/tree/issue-40

Normally I would also have suspected async_job $worker _unset_trap which was removed in 361dc17 but you said it happens even without that change.

We could also try to add a simple print statement here (prints a newline), or print -n $'\0' if that causes problems:

zsh-async/async.zsh

Line 178 in a9ca051

My thinking is that it might make a difference if we prime the zpty fd by outputting something from the worker. We shouldn't have to, but zpty's are a bit fidgety 😅.

mjrafferty · 2020-04-21T15:37:53Z

Unfortunately neither of those fixes the issue. I think it's kind of a perfect storm of timing. I think the zpty processes the job before it's even finished setting itself up. The fact that it's a slow raspberry pi combined with the worker inheriting enough extra junk from my prompt slows it down just enough to cause an issue, since even simplifying my prompt to just this one module fixes it. I found that instead of adding a print statement in the location you mentioned, adding a sleep 0.1 is enough to resolve the issue. I guess it slows down worker execution just enough for it to finish setting up.

In terms of the changes made here since 1.7.2, I reverted this line:

zsh-async/async.zsh

Line 409 in a9ca051

_async_send_job $0 $worker "$cmd"

to the following:

zpty -w $worker "$cmd"$'\0'

This also resolves the issue without the sleep being used. So this narrowed it down to the _async_send_job function itself. Removing the zpty -t block from the function resolves this as well. So it seems that being added is what caused the regression.

I don't particularly understand why that's causing the truncated output, but at least we know what change prompted the issue now.

mjrafferty · 2020-04-21T15:45:23Z

Unfortunately neither of those fixes the issue. I think it's kind of a perfect storm of timing. I think the zpty processes the job before it's even finished setting itself up. The fact that it's a slow raspberry pi combined with the worker inheriting enough extra junk from my prompt slows it down just enough to cause an issue, since even simplifying my prompt to just this one module fixes it. I found that instead of adding a print statement in the location you mentioned, adding a sleep 0.1 is enough to resolve the issue. I guess it slows down worker execution just enough for it to finish setting up.

In terms of the changes made here since 1.7.2, I reverted this line:

zsh-async/async.zsh

Line 409 in a9ca051

_async_send_job $0 $worker "$cmd"

to the following:

zpty -w $worker "$cmd"$'\0'

This also resolves the issue without the sleep being used. So this narrowed it down to the _async_send_job function itself. Removing the zpty -t block from the function resolves this as well. So it seems that being added is what caused the regression.

I don't particularly understand why that's causing the truncated output, but at least we know what change prompted the issue now.

mafredri · 2020-04-21T19:13:36Z

Interesting find! This is definitely not the first time we've had to deal with racy behavior of zptys and it's hard to figure out why they happen. The zpty implementation is kind of a hack, unfortunately.

I reverted the corruption check and pushed two more commits to the issue-40 branch. I wonder if they make any difference? I'm guessing they wont but one can hope.

There's also a test in the async test suite that can trigger racy behavior:

zsh-async/async_test.zsh

Lines 129 to 152 in a9ca051

    
           test_async_process_results_stress() { 
        
           	# NOTE: This stress test does not always pass properly on older versions of 
        
           	# zsh, sometimes writing to zpty can hang and other times reading can hang, 
        
           	# etc. 
        
           	local -a r 
        
           	cb() { r+=("$@") } 
        
           	async_start_worker test 
        
           	t_defer async_stop_worker test 
        
           	integer iter=40 timeout=5 
        
           	for i in {1..$iter}; do 
        
           		async_job test "print -n $i" 
        
           	done 
        
           	float start=$EPOCHSECONDS 
        
           	while (( $#r / 6 < iter )); do 
        
           		async_process_results test cb 
        
           		(( EPOCHSECONDS - start > timeout )) && { 
        
           			t_log "timed out after ${timeout}s" 
        
           			t_fatal "wanted $iter results, got $(( $#r / 6 ))" 
        
           		} 
        
           	done

When iter is bumped to 100 or more, commands start failing and sometimes the worker sees them entirely as missing or overlapping. Think print 80, print 81, prprint 83, etc. It's possible these two issues are related but I've spend a lot of time debugging that as well, to no avail :/.

mafredri · 2020-10-22T16:55:09Z

@mjrafferty I may have a fix for this in #45, if you still have a way to test this, it'd be very much appreciated.

What I'm thinking, specifically, could affect this is dfca4ba. It was a fix to an observation I made where the zpty was being written to from both ends at the same time.

In essence, the first job completes (and prints) while the second job is being sent to the async worker. This resulted in the first job result being erased.

mjrafferty · 2020-10-22T18:04:21Z

Unfortunately the only device I had this issue on is no longer in use. Based on prior observations though that does sound promising. I can at least confirm that using #45 didn't seem to cause any regressions.

mafredri linked a pull request Oct 22, 2020 that will close this issue

Fixes for zsh/zpty fd message exchange #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Callback function not executed on first jobs #40

Callback function not executed on first jobs #40

mjrafferty commented Apr 15, 2020

mafredri commented Apr 15, 2020

mafredri commented Apr 15, 2020

mjrafferty commented Apr 15, 2020

mafredri commented Apr 16, 2020 •

edited

Loading

mjrafferty commented Apr 16, 2020

mjrafferty commented Apr 18, 2020

mafredri commented Apr 21, 2020 •

edited

Loading

mjrafferty commented Apr 21, 2020

mjrafferty commented Apr 21, 2020

mafredri commented Apr 21, 2020

mafredri commented Oct 22, 2020

mjrafferty commented Oct 22, 2020

Callback function not executed on first jobs #40

Callback function not executed on first jobs #40

Comments

mjrafferty commented Apr 15, 2020

mafredri commented Apr 15, 2020

mafredri commented Apr 15, 2020

mjrafferty commented Apr 15, 2020

mafredri commented Apr 16, 2020 • edited Loading

mjrafferty commented Apr 16, 2020

mjrafferty commented Apr 18, 2020

mafredri commented Apr 21, 2020 • edited Loading

mjrafferty commented Apr 21, 2020

mjrafferty commented Apr 21, 2020

mafredri commented Apr 21, 2020

mafredri commented Oct 22, 2020

mjrafferty commented Oct 22, 2020

mafredri commented Apr 16, 2020 •

edited

Loading

mafredri commented Apr 21, 2020 •

edited

Loading