Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent infinite recursions in echoandcheck and SyWriteandcheck #3102

Merged
merged 1 commit into from
Jan 8, 2019

Conversation

embray
Copy link
Contributor

@embray embray commented Dec 11, 2018

This can occur in cases where the stdio streams (specifically stderr, but it can also happen with just stdout) cannot be written to for some reason, resulting in EIO errors on write() calls to their file descriptors.

The recursions occur because if an error occurs on writing to the stream, ErrorQuit is called which may in turn attempt to write to the same stream. This is similar to, but a somewhat different manifestation of #3028.

I'm not really happy with this fix since it's more of a band-aid; what is really needed is broader refactoring to error handling, as discussed partly in #2487. But that's a longer-term effort. In the meantime, this is needed for libgap, preferably in 4.10.1.

To demonstrate the problem this fixes, one easy way is to just run:

$ ./gap > /dev/full
$ ./gap > /dev/full
Error, Cannot write to file descriptor 1, see 'LastSystemError();'
 in
  Print( " ", btop, "   ", gap, " ", GAPInfo.BuildVersion, " of ", sysdate, "\n", " ", vert, "  GAP  ", vert,
 "   https://www.gap-system.org\n", " ", bbot, "   Architecture: ", GAPInfo.Architecture, "\n"
 ); at /home/embray/src/gap-system/gap/lib/init.g:510 called from
ShowKernelInformation(  ); at /home/embray/src/gap-system/gap/lib/init.g:565 called from
func(  ); at /home/embray/src/gap-system/gap/lib/system.g:215 called from
<function "CallAndInstallPostRestore">( <arguments> )
 called from read-eval loop at /home/embray/src/gap-system/gap/lib/init.g:567
Error, Cannot write to file descriptor 1, see 'LastSystemError();'

< ... some more errors ... >

<function "SESSION">( <arguments> )
 called from read-eval loop at *stdin*:1
Syntax error: ; expected

Segmentation fault (core dumped)

With the patch, the same example will result in a few error messages as well, but then wait at the prompt (which just isn't displayed).

This is just one example. A more realistic use case is one where GAP is started as a subprocess with stdout and stderr redirected to pipes, and then one of those pipes is closed.

@ChrisJefferson
Copy link
Contributor

While I agree with the direction of this, I don't think this is the way to fix it. There are two main problems:

  1. this isn't thread safe for hpcgap.

  2. I worry we could longjmp out of this function without decreasing depth.

@ChrisJefferson
Copy link
Contributor

I also wonder what the best thing to do in this case is.. if you can't print error messages, terrible things are happening. While I understand a segfault is bad, I'm not sure what's better. I wouldn't want the error message just being thrown away for example.

@ChrisJefferson
Copy link
Contributor

For example, I'd we committed this, we could end up silently dropping writes to files, without any feedback.

@embray
Copy link
Contributor Author

embray commented Dec 12, 2018

I had the same concerns, but this patch has been applied, in some form or other, in Sage's GAP package for quite a long time without any concerns. That was before multi-threaded GAP though: Perhaps it should instead it should use a thread-local variable for this in one of the state slots: Something specifically for detecting recursion into I/O functions while trying to handle error I/O...

if you can't print error messages, terrible things are happening.

That's an exaggeration. Perfectly normal things may be happening and I gave such an example: Where stderr is replaced with a pipe and one end of the pipe is closed. This should not cause a segfault.

While I understand a segfault is bad, I'm not sure what's better. I wouldn't want the error message just being thrown away for example.

In most cases where this might happen it would be during process shutdown anyways. This might even happen if GAP would have otherwise exited with a normal exit code. Instead it results in the GAP process hanging for a long time in a loop until it segfaults. I think it would be better in this case to just exit silently.

@embray
Copy link
Contributor Author

embray commented Dec 12, 2018

Out of curiosity I looked into how Python handles this sort of case in the roughly equivalent parts of its code, and it mostly just ignores such write errors and doesn't make a big deal out of them. It will still work as best it can. Even if you run the Python REPL like python > /dev/full it will mostly work.

Typing input still echos characters if readline is enabled, since it will still enable terminal echo if at least stdin is a TTY. Same on GAP with this patch--at least enough to type quit;.

@embray
Copy link
Contributor Author

embray commented Dec 12, 2018

Also,

I worry we could longjmp out of this function without decreasing depth.

That's not really a problem, since that only occurs if there is an error writing to the error stream; you would basically never recover in this case and it would stop trying to go into ErrorQuit where it can't display errors.

In other words, it's try once, and once an error has occurred trying to output to some stream it won't try to raise an error about it again.

Other approaches would be to either not raise an error in this case at all, at least if fid == 3 or maybe if fid == 1, or to somehow handle this from within the error-handling code, but that seems fragile.

@ChrisJefferson
Copy link
Contributor

ChrisJefferson commented Dec 12, 2018

I wouldn't worry if we exited cleanly, instead of segfaulting, but I worry about hiding errors. What would Sage like? We could exit cleanly, or refuse to do any more work, but I'm worried about continuing to compute and produce answers, with error messages never shown to the user.

@ChrisJefferson
Copy link
Contributor

I pressed a bad button. Ignore.

@codecov
Copy link

codecov bot commented Dec 12, 2018

Codecov Report

Merging #3102 into master will decrease coverage by <.01%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #3102      +/-   ##
==========================================
- Coverage   83.66%   83.66%   -0.01%     
==========================================
  Files         687      687              
  Lines      336685   336689       +4     
==========================================
  Hits       281696   281696              
- Misses      54989    54993       +4
Impacted Files Coverage Δ
src/sysfiles.c 41.03% <0%> (-0.12%) ⬇️
hpcgap/lib/hpc/stdtasks.g 71.77% <0%> (-0.41%) ⬇️
src/stats.c 95.3% <0%> (+0.2%) ⬆️
src/objset.c 85.35% <0%> (+0.22%) ⬆️

@embray
Copy link
Contributor Author

embray commented Dec 12, 2018

I'm worried about continuing to compute and produce answers, with error messages never shown to the user

I wouldn't worry too much about that. Again, we're talking about situations that are not likely to occur for normal users of GAP in an interactive context. There's no reason any process needs to be able to output anything to standard I/O, such as a process that is only writing to some other files. If, for some reason, you can't write to stderr then you just can't write to stderr and that's that.

@ChrisJefferson
Copy link
Contributor

I feel there is a difference in philosophy here -- my personal preference is, in a system where people want to trust the results, if we reach a state where we can't tell the user information (like errors), then the best option is to quit loudly and quickly (we could try sending a useful message to C's stdout/stderr and then call exit/abort rather than a stack overflow of course, or try and recover and redirect errors to stderr if the user has redirected them elsewhere).

What exactly is the need for this in libgap? I realise there is the bigger picture (someone closes stderr), but that doesn't (to me) seem like a super-urgent problem, so I assume this is coming up more often in Sage for some reason?

@ChrisJefferson
Copy link
Contributor

Also, the best fix for GAP (when someone has redirected output) would probably be to force error output back to stderr, and if that disappears then just abort(), but that's probably not what Sage/libgap would want.

@jdemeyer
Copy link

As @embray said, this is mostly meant to support a clean shutdown of a process which is already going to shut down anyway.

In detail: in Sage, we run GAP inside a pseudo-tty (pty). When we shut down that pty, the stdin/stdout/stderr streams are all closed. Whenever that happens, GAP will exit anyway because stdin is closed. Now if GAP tries to write anything to stdout/stderr before actually shutting down, we end up in this infinite error loop.

@jdemeyer
Copy link

What exactly is the need for this in libgap?

This PR is unrelated to libGAP.

@jdemeyer
Copy link

Some history: this bug was first noticed in 2012 (long before anything like libGAP existed!) when upgrading Sage to GAP-4.5.6

There is a long discussion thread in the Sage tracker starting at https://trac.sagemath.org/ticket/13211#comment:58 for why this patch was needed.

Supposedly, this bug was already reported as http://tracker.gap-system.org/issues/125 (but that link no longer seems to work)

@embray
Copy link
Contributor Author

embray commented Dec 19, 2018

I haven't gotten around to it yet but I planned to today: I have a simpler version of this patch that will simply not go into the default error-handling code if an error occurs in writing to stdout or stderr since that will obviously result in an unwanted recursion.

Instead I believe it should just fail silently, though there is no "right" answer there. But it would help to get out of the mindset of GAP being a command-line REPL first, but rather have low-level functionality not assume interactivity, but build interactivity on top of lower-level functionality. The problem here is that low-level I/O code can jump directly into an interactive error-handling break loop.

For the sake of backwards-compatibility we should keep that behavior for now until we've had time to do deeper refactoring. But an explicit case to prevent infinite recursion is still needed here. Cases where stdout/stderr can't be written to were likely never interactive in the first place.

…where the

default stdio output streams cannot be written to
@embray embray force-pushed the io/echoandcheck-recursion branch from 9042d29 to 56157cb Compare December 28, 2018 16:46
@embray
Copy link
Contributor Author

embray commented Dec 28, 2018

Here's a simplified alternate fix to this, which does not explicitly rely on any recursion detection, and just focuses narrowly on the affected use cases where standard I/O cannot be successfully written to.

With this fix the example from the issue description works like:

$ ./gap > /dev/full
Panic in src/sysfiles.c:1558: Could not write to stdout/stderr.
$ echo $?
1

The fact that it causes a "panic" is fine for the time being, since the point is to cause the program to exit cleanly. Panic() as currently implemented is a problem from the perspective of the libgap API since it forces the program to exit, but right now the use-case this issue is concerned about is just spawning gap as a subprocess.

If stderr is also redirected to a broken file, then the Panic message won't be displayed either, but that's sort of a given:

$ ./gap > /dev/full 2>&1
$ echo $?
1

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 82.827% when pulling 56157cb on embray:io/echoandcheck-recursion into 68f012b on gap-system:master.

@ChrisJefferson
Copy link
Contributor

I think this is fine for merging, the x is just to coveralls confusion.

@embray
Copy link
Contributor Author

embray commented Jan 8, 2019

@ChrisJefferson Can you please add backport-4.10 to this as well?

@ChrisJefferson ChrisJefferson merged commit 48b871b into gap-system:master Jan 8, 2019
@embray embray deleted the io/echoandcheck-recursion branch January 10, 2019 13:47
@fingolfin
Copy link
Member

Backported to stable-4.10 in aef2214

@jdemeyer
Copy link

I'm not convinced by this patch because there are still plenty of ways how error handling could run into an infinite loop (given that ErrorInner is actually a GAP function which is customizable and can generate an error in plenty of ways). I prefer the original version with the explicit check for recursion.

@gap-system gap-system deleted a comment from codecov bot Feb 14, 2019
@fingolfin
Copy link
Member

@jdemeyer I have no idea what the original version of this PR looked like, and I don't think there is a way to let GitHub show it to me. So it'd be best if you opened a fresh PR (with a references to this one) for further discussion. Or, if that's non-trivial work, at least open a new issue (this PR is closed and most people won't see a discussion here), with a link to some version of the patch you are talking about.

@ChrisJefferson
Copy link
Contributor

@jdemeyer : I don't mind looking at different ones to this, but I am fundementally opposed to any patch which would discard errors, if that is the PR you want to bring back (of course, I don't have solo veto power, but I really don't like it).

@embray
Copy link
Contributor Author

embray commented Feb 21, 2019

@jdemeyer It is well known that there are lots of ways you can get infinite loops in GAP's error handling. It's also been discussed on other issues that what is really needed here is a more thorough overhaul of GAP's error handling (for which I have some ideas, as I'm sure do you).

The reason we didn't use the original version of this patch is precisely because even it was misleading in the extent to which it purposed to fix infinite loops in GAP's error handling (it doesn't). It also wasn't thread safe in any way.

@olexandr-konovalov olexandr-konovalov added this to the GAP 4.10.1 milestone Feb 23, 2019
@olexandr-konovalov olexandr-konovalov changed the title prevent infinite recursions in echoandcheck SyWriteandcheck prevent infinite recursions in echoandcheck and SyWriteandcheck Feb 23, 2019
@olexandr-konovalov olexandr-konovalov added release notes: added PRs introducing changes that have since been mentioned in the release notes topic: libgap things related to libgap labels Feb 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-4.10-DONE release notes: added PRs introducing changes that have since been mentioned in the release notes topic: libgap things related to libgap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants