-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rlua error handling is completely broken on windows with rustc 1.24+ #71
Comments
So, this can be worked around right now by enabling the "unwind" feature, which requires nightly. If this doesn't get fixed soon, it might be wise to make this a hard cfg!(target_os = "windows") or something, which would just mean rlua is COMPLETELY unusable on windows with stable. |
Dang, I didn't know Windows used SEH to implement setjmp/longjmp. Of course that would unleash #21. |
Yeah, me neither. I'm hoping this is an excuse to get an official word from the rust devs on whether this is meant to be supported or not. I suppose it's leaning towards yes because they even went through the effort of making |
From the README:
I apparently have a pretty bad imagination :P |
I wrote.. possibly the worst hack: dec360f I know this isn't a solution, I just wanted to know if it was even possible. I'm kind of surprised build.rs even lets you get away with this? I mean.. I guess rlua now works again on stable with windows? My god, BUT AT WHAT COST?? |
I'm (obviously) not targeting windows with my usage of rlua. But that's probably not a good solution, if only because it's still a nightly feature and you shouldn't depend upon nightly features since they could change |
I don't really have another good solution other than waiting on the rust devs to respond. I mean, I could start writing a lot of C, but it's not JUST that I don't want to write a lot of C shims, it's also that writing C shims in the direct way where you simply write versions of a lot of functions that could longjmp but instead return error codes just ends up being really terrible. There are quite a lot of places where I "bunch up" several Lua C API calls together that could all longjmp to avoid the overhead of repeatedly calling lua_pcall (I don't do it everywhere, I need actually to do it in more places). I actually think that I would not have even chosen Lua at all if I thought Rust wouldn't be able to call half of its API safely. |
I guess I wish that crater would have caught this problem sooner, I guess currently the rust devs don't do crater runs on windows? What if I added rustc version checking with the This is really starting to bother me because it's looking like a casual rust stable update is causing me problems to the level of like.. rewriting I really wish the |
I think that's correct, but also there never was a crater run for rust-lang/rust#46833 |
So I've been trying to educate myself, reading the source to libpanic_unwind/[seh.rs, seh64_gnu.rs, windows.rs], and trying to understand better how windows SEH works. I didn't really learn anything dramatically new, but it helped cement my understanding of the issue and of the implementation of the rust runtime. Let me summarize the situation at this point, as I understand it (If I have anything wrong PLEASE correct me!): The Lua C API is pretty fundamentally based around setjmp / lonjmp for error handling. It should be clear from the implementation of this crate that I am not exactly a FAN of this, but I don't have much of a choice but to deal with it. On windows, setjmp and longjmp are implemented using SEH, so this means that Lua errors are as well. (I cannot find many specifics about HOW setjmp and longjmp are implemented with SEH, and my understanding of things like _CxxThrowException and __CxxFrameHandler3 are really shaky. If I had a better understanding of these I might be able to make a suggestion to how the rust compiler / runtime could be changed to ignore lonjmp SEH, but I'm not sure that such a change would even be welcome.) It's definitely clear from reading libpanic_unwind that anything invoking SEH is incompatible with rust panic handling, both in principle and in practice. The reason this has worked up to this point is because prior to this, at no point was there a lonjmp in Lua without the corresponding setjmp being "above" it. There are plenty of calls to catch_unwind in rlua, but they always either catch rust panics, or first call lua_pcall (setjmp), which then in turn calls some Lua function that might error (longjmp). The breaking change is the insertion of what is effectively the following code around all extern "C" calls:
which breaks the rules above. Now, any time we call lua_pcall (setjmp) to in turn call some Lua functions that might error (longjmp), in between is inserted an extra call to catch_unwind (effectively) which catches the lonjmp instead of lua_pcall, which then simply aborts. Adding This means that the following Lua C API functions that rlua uses are IMPOSSIBLE to call from rust in 1.24.0, unless you can be sure that they won't error (and in practice, it is often impossible to be sure they won't error):
This is not an exhaustive list of all 'm' or 'e' functions in the Lua C API, only the ones that are in the rlua ffi module. Also, functions marked 'm' can be guaranteed not to error if garbage collection is not currently enabled and allocations never error, and rlua takes advantage of this in a few places. If it turns out that rust does not intend to support calls to longjmp by rust (even indirectly), these functions will need to be wrapped in lua_pcall inside C code, either individually or as part of a set of other functionality. Doing so also means that the gcc crate now becomes a mandatory dependency, because there will need to be custom C code that is built along with the rlua library. Requiring the gcc crate may or may not be a problem for me on consoles, I need to actually check on this again. If it turns out that the requiring the gcc crate is not a problem, I will just start implementing wrappers around these functions in C and moving a lot of rlua internal code to C. Maybe it's less bad than I'm imagining? |
This is going to be awful, if I have to write all the callback entry points in C, I effectively cannot use lua_pushcfunction at all, only lua_pushcclosure with a rust fn pointer as one of the upvalues. |
Okay, it turns out I was COMPLETELY wrong, and not only does the gcc crate work for me in console builds, it has been working this entire time! I even made sure that it worked correctly when I set this up (using CC, AR, other env vars etc), because otherwise ogg-sys and vorbis-sys would have never built correctly. I don't know whether I'm smart because I was able to get all of this working, or really dumb for forgetting that I did. Edit: Also, apparently it's now called the cc crate, not gcc. |
The worst part of this change is actually the fact that, in order to trigger an error, there MUST be a C callback in between Lua and any Rust callback, to call lua_error on behalf of the Rust callback in case of an error. I have had to make my own "Rust callback protocol" that wraps the standard Lua C callback protocol, requiring use of another upvalue and more pointer chasing :( I'm not done yet, and it's not exactly looking very clean so far :( |
|
I mean, I know how to do it it's just infuriating. The function can't return a Result though, because the function must be called from C in order to then trigger a longjmp, because you cannot longjmp across rust ever. Currently I've added two extra constants to describe errors from a "rust-like lua_CFunction":
Edit: This is the same idea that @jonas-schievink had in #21, except 1) you can't use -1 because that's LUA_MULTRET, and you need an extra error code for not having enough room on the stack to put the error. It still sucks because it uses up the first upvalue, and that is not really transparent. |
why can't it return a Result? what stops it from returning a Result? have you considered writing a rust wrapper to convert the Result into a C-friendly alternative? |
I mean, I'm currently writing the wrapper, that's what I'm referring to. The wrapper is the part that's annoying. |
I have 0b76fbd now, but I'm not feeling great about it. One thing I completely forgot about was that Maybe I shouldn't continue this until I hear something back from the rust devs. |
Given that the current workaround already makes rlua work on stable, this might be a good idea until the Rust devs sort out whether longjmp is or isn't UB. |
Should I do an actual point release with that terrible compiler detection / RUSTC_BOOTSTRAP hack, or is that just too gross for an actual release? |
Small side note; this is a long weekend for Mozilla, so a bunch of people won't be back until tomorrow. I'd really, really hate to see the hack make it into a release; most people don't know the hack exists, and we'd prefer to keep it that way: it really, really breaks our whole stability story, so perpetuating it will create a bunch of ecosystem problems. Sorry about this :/ |
Hey @steveklabnik, thanks for the heads up about the long weekend. I actually forgot it was President's Day in the US! About the problem I'm having, do you think I should continue with my effort to hack C wrappers around everything, or should I just leave things until tomorrow when people can take a look at it. If I shouldn't use this hack in a release, I really have four options:
I'm pretty chill about most point release breakages actually, but I wasn't expecting one that required THIS much work (and the results are not gonna be pretty). If it's unlikely that there's going to be some magic resolution, I guess my best answer is 3. It's so much annoying work though that I'm trying to get some kind of signal whether I need to work on it ASAP, or stop because it might not be necessary. Edit: I guess there are other options like, leave the env var hack in master and just have Chucklefish use master for the time being. It really depends on the time frame around Edit 2: I've actually done that already, so it's not like this is an ACTIVE emergency situation or anything, I just feel really uneasy currently and don't like leaving this to sit. |
Also, I completely understand about not wanting to spread the unstable features hack around, I won't make a release like that. Would you even prefer I remove it from the repo entirely? Should I be trying to prevent this from showing up on a google search? |
@kyren as I wrote over on the Rust repo, my opinion is that we should find a solution for you, though longjmp will always require a lot of caution. |
I hear you. I expect we could get people to weigh in tomorrow, or at the latest, core team meeting on Wednesday. I just don't know who's around today; that's why I missed your stuff until now, as I actually took the weekend off, like I should have today, hehe. Kinda unfortuante that you're ahead of (most of) us on timezones, so that ends up being weds or thursday for you anyway. :/ I am gonna guess that 3 is probably the best answer too, for the same reasons. This isn't my area of specialty though, so I could be wrong.
It's cool in the end. I appreciate it, but I don't think that you need to go that far :) |
I'm trying to understand this a bit better -- it seems like the problem is pretty specific to SEH. That is, ordinarily, longjmp wouldn't trigger this panic, precisely because it doesn't unwind in a structured fashion, but rather just directly pops off the frames. In other words, we can't really "catch" a longjmp normally. But I guess that in windows longjmp is "exception-handling aware". That means, if I understand correctly, that ironically SEH is probably the one environment where a longjmp would actually be safe in Rust, since we would run dtors like normal. But precisely because we can catch it, it's also the one environment where longjmp will panic. Maybe I'm missing some subtlety though. @alexcrichton may remember more about the details of SEH. |
I actually live in the US, I just sometimes (really sometimes) keep UK hours and sometimes forget when US holidays are! Since there's no immediate emergency or anything, you guys can address this later (since it IS a holiday and all), mostly I was just going nuts because the #3 solution is so terrible that I was trying to find an excuse to stop working on it :P |
Hehe, I hear you, as I'm quite often not in my "native" timezone myself. Anyway, I will also calm down a bit on this issue too; glad @nikomatsakis was around to weigh in. |
I've written up some thoughts on the rustc bug here, but the gist of it is that |
If we continue the analogy, it would be more fair to say that there previously was this wrapper instead:
...which in your case happened to work by accident, given some unstable internal details of SEH, libunwind, msvc, LLVM etc. So, you call this a "breaking change", but your previous code was buggy, and that bug would only manifest itself once somebody decided to change some internal detail of SEH/msvc/etc, and then it would show up as a random crash (or worse - a security hole) at some customer of yours who happens to run a newer version of some component. Better to see the bug upfront, here and now, than to have to deal with it once it has been rolled out to tons of customers. Right? |
It's not UB unless Rust itself decides that it is UB, which is the entire issue. It's fine if you think that it SHOULD be UB, but think about what this implies... it means that it is effectively impossible for Rust to call MOST of the Lua C API. It's clear from the discussion in the bug report that many of the Rust devs actually don't want this state of affairs, because C APIs like these do exist in the real world. Are you advocating that it SHOULD be UB? I see no reason why it MUST be UB to call into a C function that longjmps, it's not necessarily UB in C, and it's not necessarily UB in C++ either. In fact, in order for it not to be UB in C++ the rules are quite similar to the situation here. In C++ this is UB if there are any... I forget the legalese around it but something like there are types with non-trivial destructors on the stack. This is exactly analogous to what I described as "there are only Copy types on the stack", aka no types that implement Drop. What is and isn't allowed with unsafe rust is still being decided afaict, and this seems like part of the decision process. Edit: also, I never even really claimed it was a bug, if you go back to the top of the bug report it's clear that I'm asking for clarification on what is and isn't allowed. If it's not allowed, fine, I have to do a lot of work and write a lot of gross C wrappers. I only wanted people to know that it caused breakage (who's fault it was is another matter), and what a decision about allowed / unallowed behavior would imply. I really LIKE the feature of aborting on panic on C API boundaries, I'm not advocating for its removal. I'm fine having to mark the function specially, I'm fine with it being a breaking change to need to do so. I'm NOT however terribly fine with Rust being fundamentally unable to interact with an entire class of C APIs (but I would begrudgingly accept it if that was the decision). Edit 2: I have really tried to be as reasonable as is humanly possible. I REALLY didn't want to come off as saying that Rust 100% has a bug, and it's something that needs to be FIXED, because I was fully aware this was not something that was fully specified. When I said "breaking change", I literally meant just that, a change that broke something, I wasn't trying to assign blame to anybody. I'm STILL not assigning blame to anybody, this just feels like a normal part of the process to me. If I failed to get across that I didn't think that this was necessarily a "bug" per se, I apologize. |
Let me phrase this in a different way. All I'm looking for, really, is an answer to the question: which of the following are true?
I'm fine with any answer, but I personally think 3 and especially 4 are very limiting. At least Lua, Ruby, and libpng ALL use setjmp / longjmp error handling, and even extremely limiting conditions on Rust code is vastly better than saying it is not supported at all. If the answer really is 4 though, I will accept that. In that case, however, I may consider attempting to write a Lua / Rust wrapper at all to be sort of a mistake. Edit: Also, for the record @jonas-schievink pointed out this bug quite a while ago (#21), I've just been sitting on it because the solutions for it are so gross. I maybe was a bit too harsh earlier about like.. considering this project a mistake, but I want to make 100% sure of what the position of the Rust team is before finishing writing such a gross and involved bugfix. |
@kyren
What you seem to have discovered (and what I didn't know until now) is that longjmp on the Windows platform seems to interact with SEH in a way that causes longjmp to become an unwind, and thus UB if you do it into Rust land. I'm not part of the Rust team so unfortunately I cannot give a good answer to your very valid question about if or when this will change in the near/far future. |
Look, I was aware of the language in the reference, but it has in big red text at the top that it's a best-effort document. At some point, I have to look at what is logical for the Rust compiler to do and what the Rust compiler actually does, and pick the best solution available to me. It makes sense (to me at least) for Rust functions that are simple C-like functions to act more or less like what you would write in C, and up until recently that's exactly what happened. That being said, since it is not spelled out as something safe to do, I'm not going to get bent out of shape about breakage. If the Rust team came back to me and said what you're saying, which is that you should not do that and we cannot support it, I would accept that answer, even if I was disappointed with the implication for wider Rust / C interop. Judging from their comments however, that doesn't currently seem to be the attitude of the Rust team. I'm really not trying to be the bad guy, I feel (maybe incorrectly) that you're trying to deflect blame towards me for a "bug", rather than just have a discussion about what Rust should / shouldn't do. Rust with this limitation means that there would be huge caveats on the advertisement of C interoperability, and the Rust team seems to acknowledge that. I don't want to put words in either your mouth or the Rust team's mouth, though, so I apologize if I'm misrepresenting you. I really think that there is a solution out there that both preserves the default abort behavior at C API boundaries, AND provides a way to call into C functions that trigger a longjmp based error, both of which seem to be worthy goals. |
@kyren With that said, I'm trying to understand what exactly the issue is here, and if cannot be worked around without having to change the 1.24 rustc. You're talking about having to write a lot of C shims, but so far I have only seen one:
In addition, to avoid abort in the cases where you have Should that not solve all abort issues or am I missing something? |
Okay, that's actually close to the solution to one half of the problem. When Lua calls into Rust, if there's a C wrapper in the way then yeah, the wrapper can lua_error on behalf of Rust. The trick, though, is how does the wrapper get rust_fn? There are two strategies, one strategy is that there are a finite number of rust_fns that might need to be called this way, so in order to keep the simplicity of a function pointer, just make a copy of call_rust for every possible rust_fn. This works, until you run into things like userdata destruction, where the rust function that needs to be called is derived from a user given type. In this case, you need the "full solution", which is something like this:
where RCALL_STACK_ERR and RCALL_ERR are constants equal to something like -2 and -3, so they don't collide with other magic Lua constants that the Rust callback might return (LUA_MULTRET). You have to add the additional special return for stack errors, because in the case that Rust is out of stack space, there is no other way of triggering the error. This solution is not SO bad, but it means that you lose the ability, in some cases, to pass in a lua_CFunction, and instead must pass a C closure. This is not SO SO bad, except for the fact that it's no longer transparent, so if you need to pass upvalues to the inner Rust function, the upvalues all start at 2 and you can have 254 instead of 255 of them, but whatever. Also, lua_pushcclosure can throw a memory error when lua_pushcfunction cannot, but again, whatever you can more or less deal with it with some effort. Okay, that's one half of the problem, which is to make a protocol so that Rust can return errors to Lua via an intermediary. Okay, the second problem is the other direction, where Rust calls into Lua and Lua directly errors:
I'm going to go through a full explanation of the situation, but you might be aware of a lot of this already. So, Lua's official way of protecting against longjmp errors is "lua_pcall". Lua supports calling lua_pcall inside another call to lua_pcall, and it does this via a sort of linked list of jmp_buf types that all live on the stack, where each layer adds a jmp_buf, and then on completion restores the previous jmp_buf. In the case where you call into a Lua function that errors and it is NOT surrounded at any level by lua_pcall, this triggers the function given to lua_atpanic and then an abort. A Lua API that aborted on script errors would be pretty terrible though, so you need to wrap calls into Lua with lua_pcall. Actually, you need to wrap a LOT of calls into Lua with lua_pcall, because Lua can call code via huge numbers of entry points via things like metamethods. Actually, it can call arbitrary code from HUGE NUMBERS of entry points because of the __gc metamethod, which rlua attempts (arguably stupidly) to support. This means any function marked as 'e' or 'm' in the Lua reference manual must be surrounded by lua_pcall. (Side Note: Even if you were okay with script errors being aborts some of the time, as soon as you add lua_pcall in one place, you more or less are stuck protecting ALL calls to erroring Lua functions. The reason for this is because at the top level such an error might be an abort, but if you use lua_pcall at all, inside a callback that error is instead potentially a longjmp over huge numbers of arbitrary Rust frames. In addition to this, Lua scripts themselves can use pcall, which means that even if you use lua_pcall nowhere, you still can accidentally trigger a longjmp if a Lua script calls pcall before calling a callback.) So, the normal way of protecting code calling into the Lua API in C is to push a lua_CFunction to the stack, and then use the same mechanism that protects Lua code to protect C code, lua_pcall. This clearly does not work from Rust, because you need to construct a function pointer to pass to lua_pcall to name what Lua functions you want to call, and this of course violates the no-unwind rule. Lua functions are generally not callable with lua_pcall directly as they don't follow the "lua_CFunction Protocol", so it's designed so that you pass in a C function you write yourself which then calls into the Lua API. Now, I'm actually intrigued by what you said... you mentioned calling setjmp in rust, and handling the error that way. This doesn't work directly because Lua has a whole system for handling the setjmp linked list, and recording the error code and stuff like that, but let's ignore that for a second. I might be wrong, but calling setjmp seems super scary! There are a bunch of like.. rules around how setjmp can be used even in something like C++ to protect against UB, I'm surprised you suggested this. I actually was under the impression that, I'm going to abuse terminology here but that longjmping over a "trivial" Rust stack frame was far less controversial than calling setjmp from Rust. That's not a criticism, I just hadn't really thought of that as a potential use case. My gut reaction is that dealing with a longjmp over Rust would be much less controversial than a longjmp into Rust, but I might very well be wrong! I can't think of a popular C API off the top of my head that requires such a thing, but it might exist? I mention this only because we might be thinking about these things very differently, and that means there's a good chance I'm thinking about it wrong. This IS mostly academic however, because Lua is not written so that you can call setjmp yourself, but it could maybe be modified to be so. Hell, that might even be easier than the actual solution, but I'm under the impression that this would also be UB currently in Rust, so let's assume all that won't work. The problem at this point is that, more or less, it is impossible to call into 'm' and 'e' functions from Rust. You can think of crazy solutions like some kind of function table and some kind of C function for doing arbitrary Lua calls, but all of that is generally more complicated than the simple solution of wrapping every single Lua API call in C with a custom lua_pcall invocation and having them all return error codes. You can do this! I did it, more or less, and the results are, you know, not great. Okay, it's not SO bad, until you remember that whoops, rlua provides a traceback for all errors, so you need to set the traceback function. That's not SO SO bad, you can just write the traceback function in C, until you realize that the traceback function needs to check types by calling into other Rust code. Okay, not SO SO SO bad, you can either export the traceback function as a C function and just have the C wrappers call into that, or you can make every wrapper function just accept a traceback function pointer so you don't have a bidirectional dependency. So, after doing ALL THAT, you can more or less make a safe Rust API to Lua that always manages to dance into and out of C at just the right times to avoid ever triggering a longjmp over Rust frames. Except, a large portion of your project is now in C, it's even MORE complicated and brittle, and you're kinda sad and frustrated about the whole thing. So, at that point I wouldn't blame you if you felt like giving up on the whole thing and going and trying to solve bug #39 instead. |
To be clear, the team is already planning on shipping a 1.24.1 reverting this change in a few days. |
if it's UB, should you be allowed to rely on EITHER behaviour? Why not make it random which functions get an abort and which functions don't, and have it change every time you recompile? Edit: Welp wrong repo. |
@kyren Yeah, maybe calling I do have another thought though. I'm guessing you're shipping your own Lua library on MSVC, and it seems fairly easy to reconfigure LUAI_TRY to do something else than |
can rlua monkeypatch the libc, and replace setjmp/longjmp with something it can keep track of? |
This should be fixed with 5d96ddc, eventually that fix will be removed in favor of the fix for rust-lang/rust#48251, whenever that hits stable. |
See rust-lang/rust#48251 for details.
The text was updated successfully, but these errors were encountered: