Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype for GAP_Enter/Leave macros #3096

Merged
merged 1 commit into from
Jan 22, 2019

Conversation

embray
Copy link
Contributor

@embray embray commented Dec 6, 2018

Prototype for GAP_Enter/Leave macros to bracket use of libgap and stack local GAP objects in code which embeds libgap. See #3089 for discussion.

There are two parts to this:

First, the outer-most GAP_Enter() must set the StackBottom variable for GASMAN, without which objects tracked by GASMAN that are allocated on the stack are properly tracked (see #3089).

Second, the outer-most GAP_Enter() call should set a jump point for longjmps to STATE(ReadJmpError). Code within the GAP kernel may reset this, but we should
set it here in case any unexpected errors occur within the GAP kernel that are
not already handled appropriately by a TRY_IF_NO_ERROR.

For the first issue, we add GAP_EnterStack() and GAP_LeaveStack() macros which implement just the StackBottom handling without any other error handling. We also add a function to gasman.c called _MarkStackBottomBags which just updates the StackBottom variable (this could be implemented for other GC's as well, either as a no-op or whatever else is appropriate). Then the macro called MarkStackBottomBags (same name without underscore) can be used within a function to set StackBottom to somewhere at or near the beginning of that function's stack frame. This uses
GCC's __builtin_frame_address, but support is probably needed for other platforms that don't have this.

The state variable STATE(EnterStackCount) is used to track recursive calls into GAP_EnterStack(). We only want to set StackBottom on the outer-most call. The count is decremented on GAP_LeaveStack(). Some functions are provided for manipulating the counter from the API without directly exposing the GAP state, but I'm not sure if this is necessary or desirable, especially since it means EnterStackCount isn't updated atomically. My hope was to avoid exposing too many GAP internals, but it may be necessary in order to implement these as macros.

For setting the STATE(ReadJmpError) jump buffer we provide a macro called GAP_Error_Setjmp() which is fairly straightforward, except that it needs to be written in such a way that it can be used in concert correctly with GAP_EnterStack(). In particular, if returning to the site of a GAP_Error_Setjmp() call we do not want to accidentally re-increment the EnterStackCount.

Finally, the higher-level GAP_Enter() and GAP_Leave() macros are provided: The latter is just an alias for GAP_LeaveStack(), but the former carefully combines both GAP_Error_Setjmp() and GAP_EnterStack(). Although called like a function, the GAP_Enter() macro expands to a compound statement (necessary for both GAP_EnterStack() and GAP_Error_Setjmp() to work properly). The order of expansion is also deliberate so that this can be used like:

jmp_retval = GAP_Enter();

so that the return value of the setjmp call can be stored in a variable while using GAP_Enter(), and can be checked to see whether an error occurred. However, this requires some care to ensure that the following GAP_EnterStack() doesn't increment the EnterStackCount following a return to this point via a longjmp.

@embray
Copy link
Contributor Author

embray commented Dec 6, 2018

To add: I developed this prototype against the GAP interface for Sage (which I am still planning to rework as a generic Python interface, so there is nothing particularly specific to Sage here; this should solve similar problems for other interfaces that embed GAP). I mention this though because the Sage/GAP interface tests this out quite extensively, and for that purpose at least this prototype works very well, and solves many problems I was having: In particular segfaults and other errors due to local variables being garbage-collected, and other problems related to error handling. It feels quite stable.

If it helps I can post some additional sample code and/or tests that demonstrate the kinds of problems this solves. As it is, there is certainly also a need for more documentation in this code.

@codecov
Copy link

codecov bot commented Dec 11, 2018

Codecov Report

Merging #3096 into master will decrease coverage by <.01%.
The diff coverage is 65.21%.

@@            Coverage Diff             @@
##           master    #3096      +/-   ##
==========================================
- Coverage   84.76%   84.76%   -0.01%     
==========================================
  Files         687      688       +1     
  Lines      335864   335887      +23     
==========================================
+ Hits       284690   284705      +15     
- Misses      51174    51182       +8
Impacted Files Coverage Δ
src/gasman.h 92.85% <ø> (ø) ⬆️
src/gasman.c 87.52% <100%> (+0.04%) ⬆️
src/libgap-api.h 50% <50%> (ø)
src/libgap-api.c 25.87% <63.15%> (+5.71%) ⬆️

src/gasman.c Show resolved Hide resolved
src/gasman.c Outdated Show resolved Hide resolved
src/gapstate.h Outdated Show resolved Hide resolved
src/gasman.h Outdated Show resolved Hide resolved
src/gasman.h Outdated Show resolved Hide resolved
src/libgap-api.h Outdated Show resolved Hide resolved
src/libgap-api.h Outdated Show resolved Hide resolved
src/libgap-api.h Outdated Show resolved Hide resolved
src/libgap-api.c Outdated Show resolved Hide resolved
src/libgap-api.h Outdated Show resolved Hide resolved
Copy link
Member

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested as outlined in the previous review.

@embray
Copy link
Contributor Author

embray commented Jan 4, 2019

Sorry I haven't had time to address the review comments: I will look at them now.

@embray
Copy link
Contributor Author

embray commented Jan 4, 2019

Cleaned up most of the minor nitpicks, but now doing some larger refactoring surrounding management of the EnterStackCount. I have to go home soon though so I probably won't finish until Monday.

@embray
Copy link
Contributor Author

embray commented Jan 4, 2019

I think this works, but I want to do more testing later when I have more time.

Copy link
Member

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this seems to be shaping up nicely!

src/gasman.c Outdated Show resolved Hide resolved
@embray
Copy link
Contributor Author

embray commented Jan 7, 2019

Finally finished testing this version of the patch against Sage and it looks good there too.

@embray
Copy link
Contributor Author

embray commented Jan 7, 2019

The Julia build failed again, but in this case due to a failed

wget https://julialangnightlies-s3.julialang.org/bin/linux/x64/1.1/julia-b55b85cd99-linux64.tar.gz

@fingolfin
Copy link
Member

The julia issue is resolved on master. It might be a good idea to rebase this PR on the latest master anyway.

Copy link
Member

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your continued work on this, the code seems to be shaping up quite well, the patch is small and clean. Still, I have a bunch of comments.

src/gasman.h Outdated Show resolved Hide resolved
}


static volatile sig_atomic_t EnterStackCount = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be sig_atomic_t? Which signal handler is calling code modifying this variable? If none, revert to int; otherwise, please add a comment explaining it (can be brief)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None specifically, but my concern here is that one could: E.g. a signal handler installed by GAP which runs and results in an error, which can in turn result in running code that does modify this variable. I could try to concoct a real example, but something like that seems plausible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GAP never installs such a signal handler, and indeed, calling any kind of GAP code from a signal handler would be very dangerous and is unlikely to work.

The reason I am asking is not that I want to pain the bikeshed; but rather it is that using sig_atomic_t IMHO adds considerable to the complexity of this code. If I or somebody else need to edit that code, I may wonder "but why is it using sig_atomic_t?", and not knowing the answer means that I don't understand the code, and hence better not touch it; or touch it but with a bad feeling that I might be breaking something.

This is why I ask whether it is really necessary, and if so, that a comment be added explaining it. Even if that comment says something like "// It is unclear whether we need sig_atomic_t here, but just out of paranoia, we use it anyway".

That said, I'd still prefer if it was not used; I honestly can't think of any example where it might be used that is not horribly unsafe anyway. So if you can think of any, I'd be most interested to hear about it, even if it is just a rough sketch.

The argument that one "could" need it seems week to me: with that argument, perhaps we should also add a mutex protecting access to EnterStackCount (or require C11 atomics, or something like that)? After all, multiple threads might want to use the GAP API. Sure, it's unlikely that this can work, but one could try it anyway.... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure on what basis you can assert that it "adds considerable...complexity". Exactly what complexity is it adding? sig_atomic_t is standard POSIX, and all it means is that it's an instruction to the compiler to only manipulate this variable with single instructions. Furthermore, according to the GCC docs, "In practice, you can assume that int is atomic...these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of."

It's really just a guarantee that if this code happens to be interrupted by a signal (with a signal handler that returns) it will leave things in a sane state. For a signal handler that doesn't return--i.e. exits the process or longjumps to somewhere arbitrary) you can't make any further guarantees, but at the very least you can safely increment and decrement it even if a signal occurs. That's all.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the very least you can safely increment and decrement it even if a signal occurs. That's all.

Just to clarify this: the type sig_atomic_t means that it can be read from atomically and written to atomically. Any operation more complicated than that (such as an increment/decrement) may not be atomic. For example, the operations EnterStackCount = -EnterStackCount; and EnterStackCount++; may not be atomic.

So I actually agree with @fingolfin here: better not make any guarantees about atomicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True--in that case it's probably best avoided. I know in Cysignals itself sig_block()/unblock() just set a flag that determines what to do in the actual signal handler should a signal occur. But in the case of libgap we want to leave most details of signal handling up to the embedding application, so it would be tricky to implement something like this in a way that can be sensibly embedded in arbitrary applications. So perhaps it's best to just keep it simple.

Regarding your idea, I'm not saying it can't be done. I just don't understand exactly what you're proposing. Keep in mind these macros should be able to work properly recursively. Though the exact value of EnterStackCount is less important than just tracking whether or not we're in a recursive call.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obvious question: do we really need to support nesting for these macros? It would certainly simplify things if we could say that nesting is not allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory one could write code such that they never have to be nested. For example there are cases where we could refactor a function some_func_that_uses_gap() into some_func_that_uses_gap_outer() and `some_func_that_uses_gap_inner(), where the latter can be called directly in other code that is already bracketed by the Enter/Leave macros.

So technically it's possible. But it does put a higher strain on the user in exactly how they structure their code, and it's ugly (imagine if sig_on and sig_off could not be nested). You would also have to actively prevent the user from nesting them somehow if it did not work (otherwise this would lead to hard to trace bugs). I think it's better to allow nesting. Any complication that arises from that seems better and less probable than not allowing it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I propose a simple and concrete solution:

  • In GAP, don't try to make these macros signal-safe. Do you know about other applications using libGAP which also need to handle signals?
  • In SageMath, wrap GAP_Enter/GAP_Leave in sig_block/sig_unblock calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is essentially the status quo (we don't have the Sage part yet but that's easy, and I agree with it). So I'm fine with that for lack of a better solution.

Perhaps at most this should be documented, and left to individual users to worry about. E.g. for the GAP-Julia interface something might want to be done, but it would be just as Julia-specific as Cysignals is Python-specific. Other applications may not care as much about handling signals at all.

src/libgap-api.h Outdated Show resolved Hide resolved
src/libgap-api.c Outdated Show resolved Hide resolved
src/libgap-api.h Outdated Show resolved Hide resolved
src/libgap-api.h Outdated Show resolved Hide resolved
if (buffer)
printf("%s\n", buffer);
ok = GAP_Enter();
if (ok) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... I wonder if the generic fallback code for GAP_EnterStack will work in this scenario? After all, we only reference the location of that particular stack variable. But we don't bother walking back to the start of the stack of the current function. So e.g. res could be outside the stack range considered by the GASMAN stack scanner, couldn't it? In which case there'd be a risk of a crash again.

To test this, I'd disable the GNU C version for it, and then change the loop below to perform some allocations and trigger garbage collections explicitly (we may need to expose a GAP_GC(int full) function for that; it would also be useful for some other uses, I believe).

Anyway, the "fix" in calling code would be to move all local variables storing GAP object references to inside the block between GAP_Enter and GAP_Leave (and to explicitly document this, too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you might be right. Do you even know off the top of your head (I'm not sure) if there is any guarantee whatsoever that there is any relation between the order some local variables are declared in C code, and what order they are placed on the stack (of course it could use registers for some of them too in which case it's irrelevant)? AFAIK the C standard doesn't even know anything about a "stack" as that's a platform-specific implementation detail. But realistically speaking this discussion is only relevant where there is one.

So I think there's only so much we can do about this--it's just a best guess but not fool-proof.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already talked about exposing CollectBags in #3030.

It's not exactly clear to me what you want me to do here though. Just manually test this case, or actually update the test to test this case in general somehow (e.g. have a copy of these tests that are compiled in such a way that forces the fallback)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C standard definately puts no requirements on the order of variables on the stack, and will mix them up however it likes. If you are capturing a reference to a local variable in your fallback, the old remotely safe option would be to call GAP_enter, and then call a non-inlined function to actually do the work.

I'd probably just not put in a fallback which captures a reference to a local variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm 100% positive that fallback is going to lead to hard-to-track and horrible crashes, if there are any GAP variables in the current function scope, or in any inlined functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and of course we don't expect you, @embray , to provide fallbacks fo "every imaginable architecture". My stance on that is that it's better to wait for somebody who actually uses those architectures anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in practice most compilers do seem to allocate locals in order of definition on the stack

I have noted this as well, at least with gcc. That's why I thought it was an "acceptable" fallback in the first place.

but taking an address of it and passing it to a non-inline function should avoid that

It does, again at least in the case of gcc (and actually I can't think of any other reasonable thing for any compiler to do in that case: how are you supposed to pass a pointer to a value in a register?) And of course, for locals that the compiler puts in registers this is a moot point (GASMAN handles that case separately with its setjmp hack).

So indeed, it seems best to remove the generic fallback, and wait for people to complain that their favorite compiler does not work anymore with it.

+1 I don't like it, but it does seem like the safest option for now, alas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW Boehm GC does use a similar fallback (they also declare the dummy variable volatile, which I intended to do as well but forgot): https://github.com/ivmai/bdwgc/blob/c50de12ab045b57953e152545836c23f76e7bc24/mark_rts.c#L502
It also implements a GC_get_stack_base() for quite a number of different platforms, that I think is also supposed to work properly on per-thread stcks.

they would also break GAP by itself. If this ever happens, we'll have to go back to the drawing board anyway

I think that may be the case anyways. Is there really an advantage to continuing to maintain a separate GC when something more heavily used and robust like Boehm can be used? Or at least, maybe, a customized fork thereof if there's really some fine-tuning necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gasman is much faster than Boehm, and various people's attempt to tune it haven't closed the gap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth at least borrowing bits from, such as those linked above, where applicable (and assuming license compatibility). It would be great, longer-term, to get rid of this EnterStack/LeaveStack stuff entirely.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 82.862% when pulling e19c181 on embray:libgap/enter-leave into ea00cc2 on gap-system:master.

@coveralls
Copy link

coveralls commented Jan 8, 2019

Coverage Status

Coverage decreased (-0.005%) to 84.638% when pulling fef88b4 on embray:libgap/enter-leave into 9d485d6 on gap-system:master.

@embray
Copy link
Contributor Author

embray commented Jan 10, 2019

The one failed Travis build appears to be a fluke:

+WGET='wget -N --no-check-certificate --tries=5 --waitretry=5 --retry-connrefused'
++uname
+[[ Linux == Darwin ]]
+make bootstrap-pkg-full 'WGET=wget -N --no-check-certificate --tries=5 --waitretry=5 --retry-connrefused'
--2019-01-08 16:44:23--  https://www.gap-system.org/pub/gap/gap4pkgs/packages-master.tar.gz
Resolving www.gap-system.org (www.gap-system.org)... 138.251.206.223
Connecting to www.gap-system.org (www.gap-system.org)|138.251.206.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 330711621 (315M) [application/gzip]
Saving to: ‘packages-master.tar.gz’
packages-master.tar  97%[==================> ] 307.30M   336KB/s    eta 48s    
The job exceeded the maximum time limit for jobs, and has been terminated.

Not sure if a coverage decrease of -0.9% is real or not.

@ChrisJefferson
Copy link
Contributor

I'm sorry to ask, but you could rebase on top of master? We've had some churn of tests and coverage, and it would be nice to have a clean check?

If you aren't happy with that, I can do it for you (just because we've messed things around for a while!)

@fingolfin
Copy link
Member

(I'd also consider squashing it down into a single commit; but that's not a requirement, we can perform that squash when merging it, too)

@embray
Copy link
Contributor Author

embray commented Jan 18, 2019

Okay, I'll rebase and squash.

@embray
Copy link
Contributor Author

embray commented Jan 18, 2019

Rebased and squashed; also updated the commit message a bit to more accurately reflect the final product.

src/gasman.c Outdated Show resolved Hide resolved
Copy link
Member

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from my POV, this could be merged now (ideally, I'd like to rename MarkStackBottomBags, but I can do that after this PR is merged, too).

Would be nice to hear if @ChrisJefferson and @markuspf are OK with this, too. But if I hear no objections from anybody soon, I'll just merge this :)

@ChrisJefferson
Copy link
Contributor

I'm happy for this to be merged, we can always cycle more later

@embray
Copy link
Contributor Author

embray commented Jan 21, 2019

Why don't I just go ahead and rename MarkStackBottomBags. I don't think it's that important, but I also understand the rationale so if you want to change it might as well do it now before merging.

…bjects in

code which embeds libgap

There are two parts to this:

First, the outer-most GAP_Enter() must set the StackBottom variable for GASMAN,
without which objects tracked by GASMAN that are allocated on the stack are
properly tracked (see gap-system#3089).

Second, the outer-most GAP_Enter() call should set a jump point for longjmps to
STATE(ReadJmpError).  Code within the GAP kernel may reset this, but we should
set it here in case any unexpected errors occur within the GAP kernel that are
not already handled appropriately by a TRY_IF_NO_ERROR.  Or, more precisely,
since TRY_IF_NO_ERROR does not currently *clear* the jump buffer it sets even
when leaving the block without any errors found, a subsequent error that is not
within a TRY_IF_NO_ERROR block can still result in a jump to the last
TRY_IF_NO_ERROR block which was entered, which can be quite arbitrary (e.g.
inside IntializeGap).

For the first issue, we add GAP_EnterStack() and GAP_LeaveStack() macros which
implement *just* the StackBottom handling without any other error handling.  We
also add a function to gasman.c called SetStackBottomBags which just updates
the StackBottom variable.  Then the GAP_EnterStack() macro can be used within a
function to set StackBottom to somewhere at or near the beginning of that
function's stack frame.  This uses GCC's __builtin_frame_address, but supported
is probably needed for other platforms that don't have this.

The global variable EnterStackCount in libgap-api.c is used to track recursive
calls into GAP_EnterStack().  We only want to set StackBottom on the outer-most
call.  The count is decremented on GAP_LeaveStack().

For setting the STATE(ReadJmpError) jump buffer we provide a macro called
GAP_Error_Setjmp() which is fairly straightforward, except that it needs to be
written in such a way that it can be used in concert correctly with
GAP_EnterStack().  In particular, if returning to the site of a
GAP_Error_Setjmp() call we do not want to accidentally re-increment the
EnterStackCount.

Finally, the higher-level GAP_Enter() and GAP_Leave() macros are provided: The
latter is just an alias for GAP_LeaveStack(), but the former carefully combines
*both* GAP_Error_Setjmp() and GAP_EnterStack().  Although called like a
function, the GAP_Enter() macro expands to a compound statement (necessary for
both GAP_EnterStack() and GAP_Error_Setjmp() to work properly).  The order of
expansion is also deliberate so that this can be used like:

    jmp_retval = GAP_Enter();

so that the return value of the setjmp call can be stored in a variable while
using GAP_Enter(), and can be checked to see whether an error occurred.
However, this requires some care to ensure that the following GAP_EnterStack()
doesn't increment the EnterStackCount following a return to this point via a
longjmp.

Also updates the libgap API tests to demonstrate usage of GAP_Enter/Leave().

Note: The whole EnterStackCount manipulation is really really only in service
to GASMAN and can be disabled entirely when compiled for a different GC, but it
still has some use as a sanity check on GAP_Enter/Leave() so for not it is
mostly but not entirely a no-op on GAP built with other GCs.
@fingolfin fingolfin added topic: kernel topic: libgap things related to libgap labels Jan 22, 2019
@fingolfin fingolfin merged commit e868e29 into gap-system:master Jan 22, 2019
@embray embray deleted the libgap/enter-leave branch January 23, 2019 11:45
@embray
Copy link
Contributor Author

embray commented Jan 23, 2019

Thank you! 🥂

@fingolfin
Copy link
Member

Thank you

@fingolfin
Copy link
Member

Backported via PR #3232

@olexandr-konovalov olexandr-konovalov added this to the GAP 4.10.1 milestone Feb 23, 2019
@olexandr-konovalov olexandr-konovalov added the release notes: added PRs introducing changes that have since been mentioned in the release notes label Feb 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-4.10-DONE release notes: added PRs introducing changes that have since been mentioned in the release notes topic: kernel topic: libgap things related to libgap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants