Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix off-by-one in NextFreeFileDescriptor #202

Closed
okayzed opened this issue Oct 21, 2018 · 9 comments
Closed

Fix off-by-one in NextFreeFileDescriptor #202

okayzed opened this issue Oct 21, 2018 · 9 comments

Comments

@okayzed
Copy link
Contributor

okayzed commented Oct 21, 2018

https://github.com/oilshell/oil/blob/master/core/process.py#L54

If i'm understanding this code correctly, the intended behavior is: loop over file descriptors until we find one that throws an error when we try to access it. when we find it, we return that filedescriptor.

The actual behavior: return that filedescriptor+1

This could come up as a problem if we load 2 file descriptors then close the first one. the impl would find the first one as available and then try to dup over the 2nd one.

Feel free to close if not an issue

@andychu
Copy link
Contributor

andychu commented Oct 21, 2018

Hm yes good point. That explains the comment...

The bug was that other libraries in the process were using file descriptors. So this fix was wrong. But I fixed it another way by not using those libraries (import random). But yes this function should probably be fixed just in case.

@andychu
Copy link
Contributor

andychu commented Oct 21, 2018

And that was the "hardest bug" I mentioned in the appendix here:

http://www.oilshell.org/blog/2018/10/11.html

I already hit that bug twice!

@okayzed
Copy link
Contributor Author

okayzed commented Oct 26, 2018

If you can add a test case (or put one here) that repros the filedescriptor bug, it will make it easier to fix

@andychu
Copy link
Contributor

andychu commented Oct 27, 2018

I don't think this issue is observable right now. If nobody else is opening file descriptors in the process, there's no conflict.

But I'll just leave this open since it's probably a good cleanup.

@andychu andychu changed the title Question on NextFreeFileDescriptor impl. Fix off-by-one in NextFreeFileDescriptor Oct 27, 2018
@okayzed
Copy link
Contributor Author

okayzed commented Oct 27, 2018

It might not get fixed properly without a repro or way of checking that it's fixed, even if this line of code is fixed. It's easy enough to change this seemingly wrong line, but unknown if that was causing issues in the first place

@andychu
Copy link
Contributor

andychu commented Jan 26, 2019

@okayzed

Oh man I'm in the middle of debugging a problem here and the root cause was in this code. The observable behavior was that somehow stdout was closed when running OSH unit tests under OSH, but not when running OSH unit tests under bash!

I think there are probably other conditions where it could trigger, but OSH inheriting its own file descriptors seems to be one of them.

I have made the bug go away by fixing this line, but it also pointed to me the fact that I'm using the syscalls somewhat wrong.

If you look at man fcntl, it points out that fcntl(..., F_DUPFD, ...) is different than dup2(). I was wondering what the difference was. Basically the former does the linear search for you -- it returns the next free descriptor. I was doing it in user space, which is suboptimal (and was wrong).

Good eye on this one!

In retrospect, it may or may not have made sense to fix earlier... it would have saved some time, but I probably wouldn't have understood the syscall issue as deeply. Sometimes you just have to bang your head into things to understand ...

andychu pushed a commit that referenced this issue Jan 26, 2019
The symtom was that running 'test/unit.sh all' under OSH failed.  In
core/completion_test.py, we run the line:

  echo "$@" >&2

under test_lib.EvalCode(), and the two OSH interpreters conflicted
somehow, resulting in stdout being permanently closed.  Then any
remaining print() calls failed with EBADF.

I got rid of the self.next_fd counter since it's not correct.  Instead
we do a linear search from fd #10.  Now the tests pass.

However now I realize I should be using fnctl(..., F_DUPFD, ...) and
looking at the return value.  It does the linear search for you in the
kernel!  Looking at dash source code clued me into this.

Addresses issue #202.
@okayzed
Copy link
Contributor Author

okayzed commented Jan 27, 2019

I'm glad something came of this!

re: fcntl: to clarify for my own understanding, it seems like using dup2 can lead to some race condition, where you check if a FD is available and it gets taken before you use it (if there is multithreading), while fcntl would prevent that situation

actually, i'm unsure of the above. if dup2 is returning the error, that would mean that the call to dup2 is what reserves the FD slot.

@andychu
Copy link
Contributor

andychu commented Jan 27, 2019

If there were two threads doing that at the same time, it would definitely lead to a race condition. But almost all shells are single-threaded, since they date back to the time before threads!

There is one area where I might want to use threads (to make completion responsive against slow user plugins), so that is something to watch out for.

@andychu
Copy link
Contributor

andychu commented Feb 4, 2019

closing in favor of #223

@andychu andychu closed this as completed Feb 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants