Prototype for GAP_Enter/Leave macros #3096

embray · 2018-12-06T16:39:23Z

Prototype for GAP_Enter/Leave macros to bracket use of libgap and stack local GAP objects in code which embeds libgap. See #3089 for discussion.

There are two parts to this:

First, the outer-most GAP_Enter() must set the StackBottom variable for GASMAN, without which objects tracked by GASMAN that are allocated on the stack are properly tracked (see #3089).

Second, the outer-most GAP_Enter() call should set a jump point for longjmps to STATE(ReadJmpError). Code within the GAP kernel may reset this, but we should
set it here in case any unexpected errors occur within the GAP kernel that are
not already handled appropriately by a TRY_IF_NO_ERROR.

For the first issue, we add GAP_EnterStack() and GAP_LeaveStack() macros which implement just the StackBottom handling without any other error handling. We also add a function to gasman.c called _MarkStackBottomBags which just updates the StackBottom variable (this could be implemented for other GC's as well, either as a no-op or whatever else is appropriate). Then the macro called MarkStackBottomBags (same name without underscore) can be used within a function to set StackBottom to somewhere at or near the beginning of that function's stack frame. This uses
GCC's __builtin_frame_address, but support is probably needed for other platforms that don't have this.

The state variable STATE(EnterStackCount) is used to track recursive calls into GAP_EnterStack(). We only want to set StackBottom on the outer-most call. The count is decremented on GAP_LeaveStack(). Some functions are provided for manipulating the counter from the API without directly exposing the GAP state, but I'm not sure if this is necessary or desirable, especially since it means EnterStackCount isn't updated atomically. My hope was to avoid exposing too many GAP internals, but it may be necessary in order to implement these as macros.

For setting the STATE(ReadJmpError) jump buffer we provide a macro called GAP_Error_Setjmp() which is fairly straightforward, except that it needs to be written in such a way that it can be used in concert correctly with GAP_EnterStack(). In particular, if returning to the site of a GAP_Error_Setjmp() call we do not want to accidentally re-increment the EnterStackCount.

Finally, the higher-level GAP_Enter() and GAP_Leave() macros are provided: The latter is just an alias for GAP_LeaveStack(), but the former carefully combines both GAP_Error_Setjmp() and GAP_EnterStack(). Although called like a function, the GAP_Enter() macro expands to a compound statement (necessary for both GAP_EnterStack() and GAP_Error_Setjmp() to work properly). The order of expansion is also deliberate so that this can be used like:

jmp_retval = GAP_Enter();

so that the return value of the setjmp call can be stored in a variable while using GAP_Enter(), and can be checked to see whether an error occurred. However, this requires some care to ensure that the following GAP_EnterStack() doesn't increment the EnterStackCount following a return to this point via a longjmp.

embray · 2018-12-06T16:43:18Z

To add: I developed this prototype against the GAP interface for Sage (which I am still planning to rework as a generic Python interface, so there is nothing particularly specific to Sage here; this should solve similar problems for other interfaces that embed GAP). I mention this though because the Sage/GAP interface tests this out quite extensively, and for that purpose at least this prototype works very well, and solves many problems I was having: In particular segfaults and other errors due to local variables being garbage-collected, and other problems related to error handling. It feels quite stable.

If it helps I can post some additional sample code and/or tests that demonstrate the kinds of problems this solves. As it is, there is certainly also a need for more documentation in this code.

codecov · 2018-12-11T13:36:46Z

Codecov Report

Merging #3096 into master will decrease coverage by <.01%.
The diff coverage is 65.21%.

@@            Coverage Diff             @@
##           master    #3096      +/-   ##
==========================================
- Coverage   84.76%   84.76%   -0.01%     
==========================================
  Files         687      688       +1     
  Lines      335864   335887      +23     
==========================================
+ Hits       284690   284705      +15     
- Misses      51174    51182       +8

Impacted Files	Coverage Δ
src/gasman.h	`92.85% <ø> (ø)`	⬆️
src/gasman.c	`87.52% <100%> (+0.04%)`	⬆️
src/libgap-api.h	`50% <50%> (ø)`
src/libgap-api.c	`25.87% <63.15%> (+5.71%)`	⬆️

src/gasman.c

src/gapstate.h

src/gasman.h

src/libgap-api.h

src/libgap-api.c

src/libgap-api.h

fingolfin

Changes requested as outlined in the previous review.

embray · 2019-01-04T12:41:37Z

Sorry I haven't had time to address the review comments: I will look at them now.

embray · 2019-01-04T15:29:13Z

Cleaned up most of the minor nitpicks, but now doing some larger refactoring surrounding management of the EnterStackCount. I have to go home soon though so I probably won't finish until Monday.

embray · 2019-01-04T16:06:19Z

I think this works, but I want to do more testing later when I have more time.

fingolfin

Thank you, this seems to be shaping up nicely!

src/gasman.c

embray · 2019-01-07T14:52:15Z

Finally finished testing this version of the patch against Sage and it looks good there too.

embray · 2019-01-07T14:54:54Z

The Julia build failed again, but in this case due to a failed

wget https://julialangnightlies-s3.julialang.org/bin/linux/x64/1.1/julia-b55b85cd99-linux64.tar.gz

fingolfin · 2019-01-08T10:32:02Z

The julia issue is resolved on master. It might be a good idea to rebase this PR on the latest master anyway.

fingolfin

Thank you for your continued work on this, the code seems to be shaping up quite well, the patch is small and clean. Still, I have a bunch of comments.

src/gasman.h

fingolfin · 2019-01-08T10:38:50Z

src/libgap-api.c

+}
+
+
+static volatile sig_atomic_t EnterStackCount = 0;


Why does this need to be sig_atomic_t? Which signal handler is calling code modifying this variable? If none, revert to int; otherwise, please add a comment explaining it (can be brief)

None specifically, but my concern here is that one could: E.g. a signal handler installed by GAP which runs and results in an error, which can in turn result in running code that does modify this variable. I could try to concoct a real example, but something like that seems plausible.

GAP never installs such a signal handler, and indeed, calling any kind of GAP code from a signal handler would be very dangerous and is unlikely to work.

The reason I am asking is not that I want to pain the bikeshed; but rather it is that using sig_atomic_t IMHO adds considerable to the complexity of this code. If I or somebody else need to edit that code, I may wonder "but why is it using sig_atomic_t?", and not knowing the answer means that I don't understand the code, and hence better not touch it; or touch it but with a bad feeling that I might be breaking something.

This is why I ask whether it is really necessary, and if so, that a comment be added explaining it. Even if that comment says something like "// It is unclear whether we need sig_atomic_t here, but just out of paranoia, we use it anyway".

That said, I'd still prefer if it was not used; I honestly can't think of any example where it might be used that is not horribly unsafe anyway. So if you can think of any, I'd be most interested to hear about it, even if it is just a rough sketch.

The argument that one "could" need it seems week to me: with that argument, perhaps we should also add a mutex protecting access to EnterStackCount (or require C11 atomics, or something like that)? After all, multiple threads might want to use the GAP API. Sure, it's unlikely that this can work, but one could try it anyway.... ?

I'm not sure on what basis you can assert that it "adds considerable...complexity". Exactly what complexity is it adding? sig_atomic_t is standard POSIX, and all it means is that it's an instruction to the compiler to only manipulate this variable with single instructions. Furthermore, according to the GCC docs, "In practice, you can assume that int is atomic...these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of."

It's really just a guarantee that if this code happens to be interrupted by a signal (with a signal handler that returns) it will leave things in a sane state. For a signal handler that doesn't return--i.e. exits the process or longjumps to somewhere arbitrary) you can't make any further guarantees, but at the very least you can safely increment and decrement it even if a signal occurs. That's all.

at the very least you can safely increment and decrement it even if a signal occurs. That's all.

Just to clarify this: the type sig_atomic_t means that it can be read from atomically and written to atomically. Any operation more complicated than that (such as an increment/decrement) may not be atomic. For example, the operations EnterStackCount = -EnterStackCount; and EnterStackCount++; may not be atomic.

So I actually agree with @fingolfin here: better not make any guarantees about atomicity.

True--in that case it's probably best avoided. I know in Cysignals itself sig_block()/unblock() just set a flag that determines what to do in the actual signal handler should a signal occur. But in the case of libgap we want to leave most details of signal handling up to the embedding application, so it would be tricky to implement something like this in a way that can be sensibly embedded in arbitrary applications. So perhaps it's best to just keep it simple.

Regarding your idea, I'm not saying it can't be done. I just don't understand exactly what you're proposing. Keep in mind these macros should be able to work properly recursively. Though the exact value of EnterStackCount is less important than just tracking whether or not we're in a recursive call.

Obvious question: do we really need to support nesting for these macros? It would certainly simplify things if we could say that nesting is not allowed.

In theory one could write code such that they never have to be nested. For example there are cases where we could refactor a function some_func_that_uses_gap() into some_func_that_uses_gap_outer() and `some_func_that_uses_gap_inner(), where the latter can be called directly in other code that is already bracketed by the Enter/Leave macros.

So technically it's possible. But it does put a higher strain on the user in exactly how they structure their code, and it's ugly (imagine if sig_on and sig_off could not be nested). You would also have to actively prevent the user from nesting them somehow if it did not work (otherwise this would lead to hard to trace bugs). I think it's better to allow nesting. Any complication that arises from that seems better and less probable than not allowing it.

For now, I propose a simple and concrete solution:

In GAP, don't try to make these macros signal-safe. Do you know about other applications using libGAP which also need to handle signals?

In SageMath, wrap GAP_Enter/GAP_Leave in sig_block/sig_unblock calls.

Which is essentially the status quo (we don't have the Sage part yet but that's easy, and I agree with it). So I'm fine with that for lack of a better solution.

Perhaps at most this should be documented, and left to individual users to worry about. E.g. for the GAP-Julia interface something might want to be done, but it would be just as Julia-specific as Cysignals is Python-specific. Other applications may not care as much about handling signals at all.

src/libgap-api.h

src/libgap-api.c

src/libgap-api.h

fingolfin · 2019-01-08T11:24:57Z

tst/testlibgap/common.c

-            if (buffer)
-                printf("%s\n", buffer);
+    ok = GAP_Enter();
+    if (ok) {


Hmmm... I wonder if the generic fallback code for GAP_EnterStack will work in this scenario? After all, we only reference the location of that particular stack variable. But we don't bother walking back to the start of the stack of the current function. So e.g. res could be outside the stack range considered by the GASMAN stack scanner, couldn't it? In which case there'd be a risk of a crash again.

To test this, I'd disable the GNU C version for it, and then change the loop below to perform some allocations and trigger garbage collections explicitly (we may need to expose a GAP_GC(int full) function for that; it would also be useful for some other uses, I believe).

Anyway, the "fix" in calling code would be to move all local variables storing GAP object references to inside the block between GAP_Enter and GAP_Leave (and to explicitly document this, too).

Yes, you might be right. Do you even know off the top of your head (I'm not sure) if there is any guarantee whatsoever that there is any relation between the order some local variables are declared in C code, and what order they are placed on the stack (of course it could use registers for some of them too in which case it's irrelevant)? AFAIK the C standard doesn't even know anything about a "stack" as that's a platform-specific implementation detail. But realistically speaking this discussion is only relevant where there is one.

So I think there's only so much we can do about this--it's just a best guess but not fool-proof.

I think we already talked about exposing CollectBags in #3030.

It's not exactly clear to me what you want me to do here though. Just manually test this case, or actually update the test to test this case in general somehow (e.g. have a copy of these tests that are compiled in such a way that forces the fallback)?

The C standard definately puts no requirements on the order of variables on the stack, and will mix them up however it likes. If you are capturing a reference to a local variable in your fallback, the old remotely safe option would be to call GAP_enter, and then call a non-inlined function to actually do the work.

I'd probably just not put in a fallback which captures a reference to a local variable.

To be clear, I'm 100% positive that fallback is going to lead to hard-to-track and horrible crashes, if there are any GAP variables in the current function scope, or in any inlined functions.

Oh, and of course we don't expect you, @embray , to provide fallbacks fo "every imaginable architecture". My stance on that is that it's better to wait for somebody who actually uses those architectures anyway.

Note that in practice most compilers do seem to allocate locals in order of definition on the stack

I have noted this as well, at least with gcc. That's why I thought it was an "acceptable" fallback in the first place.

but taking an address of it and passing it to a non-inline function should avoid that

It does, again at least in the case of gcc (and actually I can't think of any other reasonable thing for any compiler to do in that case: how are you supposed to pass a pointer to a value in a register?) And of course, for locals that the compiler puts in registers this is a moot point (GASMAN handles that case separately with its setjmp hack).

So indeed, it seems best to remove the generic fallback, and wait for people to complain that their favorite compiler does not work anymore with it.

+1 I don't like it, but it does seem like the safest option for now, alas.

FWIW Boehm GC does use a similar fallback (they also declare the dummy variable volatile, which I intended to do as well but forgot): https://github.com/ivmai/bdwgc/blob/c50de12ab045b57953e152545836c23f76e7bc24/mark_rts.c#L502
It also implements a GC_get_stack_base() for quite a number of different platforms, that I think is also supposed to work properly on per-thread stcks.

they would also break GAP by itself. If this ever happens, we'll have to go back to the drawing board anyway

I think that may be the case anyways. Is there really an advantage to continuing to maintain a separate GC when something more heavily used and robust like Boehm can be used? Or at least, maybe, a customized fork thereof if there's really some fine-tuning necessary?

gasman is much faster than Boehm, and various people's attempt to tune it haven't closed the gap.

It might be worth at least borrowing bits from, such as those linked above, where applicable (and assuming license compatibility). It would be great, longer-term, to get rid of this EnterStack/LeaveStack stuff entirely.

coveralls · 2019-01-08T13:55:06Z

Coverage increased (+0.01%) to 82.862% when pulling e19c181 on embray:libgap/enter-leave into ea00cc2 on gap-system:master.

coveralls · 2019-01-08T13:55:07Z

Coverage decreased (-0.005%) to 84.638% when pulling fef88b4 on embray:libgap/enter-leave into 9d485d6 on gap-system:master.

embray · 2019-01-10T17:29:54Z

The one failed Travis build appears to be a fluke:

+WGET='wget -N --no-check-certificate --tries=5 --waitretry=5 --retry-connrefused'
++uname
+[[ Linux == Darwin ]]
+make bootstrap-pkg-full 'WGET=wget -N --no-check-certificate --tries=5 --waitretry=5 --retry-connrefused'
--2019-01-08 16:44:23--  https://www.gap-system.org/pub/gap/gap4pkgs/packages-master.tar.gz
Resolving www.gap-system.org (www.gap-system.org)... 138.251.206.223
Connecting to www.gap-system.org (www.gap-system.org)|138.251.206.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330711621 (315M) [application/gzip]
Saving to: ‘packages-master.tar.gz’
packages-master.tar  97%[==================> ] 307.30M   336KB/s    eta 48s    
The job exceeded the maximum time limit for jobs, and has been terminated.

Not sure if a coverage decrease of -0.9% is real or not.

ChrisJefferson · 2019-01-18T09:35:41Z

I'm sorry to ask, but you could rebase on top of master? We've had some churn of tests and coverage, and it would be nice to have a clean check?

If you aren't happy with that, I can do it for you (just because we've messed things around for a while!)

fingolfin · 2019-01-18T11:22:26Z

(I'd also consider squashing it down into a single commit; but that's not a requirement, we can perform that squash when merging it, too)

embray · 2019-01-18T12:48:54Z

Okay, I'll rebase and squash.

embray · 2019-01-18T13:28:51Z

Rebased and squashed; also updated the commit message a bit to more accurately reflect the final product.

src/gasman.c

fingolfin

So from my POV, this could be merged now (ideally, I'd like to rename MarkStackBottomBags, but I can do that after this PR is merged, too).

Would be nice to hear if @ChrisJefferson and @markuspf are OK with this, too. But if I hear no objections from anybody soon, I'll just merge this :)

ChrisJefferson · 2019-01-21T11:44:42Z

I'm happy for this to be merged, we can always cycle more later

embray · 2019-01-21T16:57:22Z

Why don't I just go ahead and rename MarkStackBottomBags. I don't think it's that important, but I also understand the rationale so if you want to change it might as well do it now before merging.

…bjects in code which embeds libgap There are two parts to this: First, the outer-most GAP_Enter() must set the StackBottom variable for GASMAN, without which objects tracked by GASMAN that are allocated on the stack are properly tracked (see gap-system#3089). Second, the outer-most GAP_Enter() call should set a jump point for longjmps to STATE(ReadJmpError). Code within the GAP kernel may reset this, but we should set it here in case any unexpected errors occur within the GAP kernel that are not already handled appropriately by a TRY_IF_NO_ERROR. Or, more precisely, since TRY_IF_NO_ERROR does not currently *clear* the jump buffer it sets even when leaving the block without any errors found, a subsequent error that is not within a TRY_IF_NO_ERROR block can still result in a jump to the last TRY_IF_NO_ERROR block which was entered, which can be quite arbitrary (e.g. inside IntializeGap). For the first issue, we add GAP_EnterStack() and GAP_LeaveStack() macros which implement *just* the StackBottom handling without any other error handling. We also add a function to gasman.c called SetStackBottomBags which just updates the StackBottom variable. Then the GAP_EnterStack() macro can be used within a function to set StackBottom to somewhere at or near the beginning of that function's stack frame. This uses GCC's __builtin_frame_address, but supported is probably needed for other platforms that don't have this. The global variable EnterStackCount in libgap-api.c is used to track recursive calls into GAP_EnterStack(). We only want to set StackBottom on the outer-most call. The count is decremented on GAP_LeaveStack(). For setting the STATE(ReadJmpError) jump buffer we provide a macro called GAP_Error_Setjmp() which is fairly straightforward, except that it needs to be written in such a way that it can be used in concert correctly with GAP_EnterStack(). In particular, if returning to the site of a GAP_Error_Setjmp() call we do not want to accidentally re-increment the EnterStackCount. Finally, the higher-level GAP_Enter() and GAP_Leave() macros are provided: The latter is just an alias for GAP_LeaveStack(), but the former carefully combines *both* GAP_Error_Setjmp() and GAP_EnterStack(). Although called like a function, the GAP_Enter() macro expands to a compound statement (necessary for both GAP_EnterStack() and GAP_Error_Setjmp() to work properly). The order of expansion is also deliberate so that this can be used like: jmp_retval = GAP_Enter(); so that the return value of the setjmp call can be stored in a variable while using GAP_Enter(), and can be checked to see whether an error occurred. However, this requires some care to ensure that the following GAP_EnterStack() doesn't increment the EnterStackCount following a return to this point via a longjmp. Also updates the libgap API tests to demonstrate usage of GAP_Enter/Leave(). Note: The whole EnterStackCount manipulation is really really only in service to GASMAN and can be disabled entirely when compiled for a different GC, but it still has some use as a sanity check on GAP_Enter/Leave() so for not it is mostly but not entirely a no-op on GAP built with other GCs.

embray · 2019-01-23T11:46:11Z

Thank you! 🥂

fingolfin · 2019-01-23T19:29:13Z

Thank you

fingolfin · 2019-01-25T22:21:54Z

Backported via PR #3232

embray mentioned this pull request Dec 6, 2018

libgap API: Problems with GASMAN and StackBottom #3089

Open

fingolfin reviewed Dec 13, 2018

View reviewed changes

fingolfin requested changes Dec 19, 2018

View reviewed changes

fingolfin reviewed Jan 7, 2019

View reviewed changes

src/gasman.c Outdated Show resolved Hide resolved

fingolfin reviewed Jan 8, 2019

View reviewed changes

embray force-pushed the libgap/enter-leave branch from b0325e6 to 6005543 Compare January 8, 2019 13:04

embray force-pushed the libgap/enter-leave branch from 51a4ad8 to 4bb3063 Compare January 8, 2019 15:51

embray force-pushed the libgap/enter-leave branch from 4bb3063 to 548ff1a Compare January 18, 2019 13:27

fingolfin reviewed Jan 20, 2019

View reviewed changes

src/gasman.c Outdated Show resolved Hide resolved

fingolfin mentioned this pull request Jan 20, 2019

Julia GC: fix stack marking #3199

Merged

fingolfin approved these changes Jan 21, 2019

View reviewed changes

embray force-pushed the libgap/enter-leave branch from 548ff1a to fef88b4 Compare January 22, 2019 12:36

fingolfin added topic: kernel topic: libgap things related to libgap labels Jan 22, 2019

fingolfin merged commit e868e29 into gap-system:master Jan 22, 2019

embray deleted the libgap/enter-leave branch January 23, 2019 11:45

fingolfin mentioned this pull request Jan 25, 2019

Backport GAP_Enter/Leave macros to stable-4.10 #3232

Merged

fingolfin added the backport-to-4.10-DONE label Jan 25, 2019

olexandr-konovalov added this to the GAP 4.10.1 milestone Feb 23, 2019

olexandr-konovalov added the release notes: added PRs introducing changes that have since been mentioned in the release notes label Feb 23, 2019

nthiery mentioned this pull request Jan 13, 2023

Upgrade to GAP 4.10 sagemath/sage#22626

Closed

Prototype for GAP_Enter/Leave macros #3096

Prototype for GAP_Enter/Leave macros #3096

Conversation

embray commented Dec 6, 2018

embray commented Dec 6, 2018 • edited Loading

codecov bot commented Dec 11, 2018 • edited Loading

Codecov Report

fingolfin left a comment

Choose a reason for hiding this comment

embray commented Jan 4, 2019

embray commented Jan 4, 2019 • edited Loading

embray commented Jan 4, 2019

fingolfin left a comment

Choose a reason for hiding this comment

embray commented Jan 7, 2019

embray commented Jan 7, 2019

fingolfin commented Jan 8, 2019

fingolfin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jan 8, 2019

coveralls commented Jan 8, 2019 • edited Loading

embray commented Jan 10, 2019

ChrisJefferson commented Jan 18, 2019

fingolfin commented Jan 18, 2019

embray commented Jan 18, 2019

embray commented Jan 18, 2019

fingolfin left a comment

Choose a reason for hiding this comment

ChrisJefferson commented Jan 21, 2019

embray commented Jan 21, 2019

embray commented Jan 23, 2019

fingolfin commented Jan 23, 2019

fingolfin commented Jan 25, 2019

embray commented Dec 6, 2018 •

edited

Loading

codecov bot commented Dec 11, 2018 •

edited

Loading

embray commented Jan 4, 2019 •

edited

Loading

coveralls commented Jan 8, 2019 •

edited

Loading