Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling the GAP kernel's signal handling when using the API #3072

Merged
merged 1 commit into from
Feb 1, 2019

Conversation

embray
Copy link
Contributor

@embray embray commented Nov 30, 2018

Not being able to do this easily is a problem for embedding in other applications that have their own signal handling. This can be worked around by saving/restoring the signal handling state before/after GAP_Initialize, but it would be better if we could just disable GAP from setting its own SIGINT handler completely.

@embray
Copy link
Contributor Author

embray commented Nov 30, 2018

Oops, doesn't even compile. Weird, I must have accidentally hit 'x' in vim or something.

@embray embray force-pushed the api/optional-signal-handling branch from 7417a58 to 8a9207f Compare November 30, 2018 10:56
@embray
Copy link
Contributor Author

embray commented Nov 30, 2018

Need to fix libgap-api test.

Alternatively, if we don't want to modify the signature for GAP_Initialize already, I can add a GAP_InitializeEx or something that adds the signal-handling flag. Given that this API isn't very firm yet anyways I don't know how much you want to start modifying it.

Either way some version of this fix would be nice to have backported to 4.10.x (cc @alex-konovalov)

@embray
Copy link
Contributor Author

embray commented Dec 3, 2018

I also just discovered the ChildStatusChanged SIGCHLD handler that is installed by InitKernel in src/iostreams.c. This functionality should also prevent that from happening, at least on initialization. But that might take a little more refactoring; have to think about it. The SIGCHLD handle is more necessary than the SIGINT handler, at least for code that uses CREATE_PTY_IOSTREAM, but perhaps in a way that integrates better with other systems.

Has this been handled in any way by any of the other GAP interfaces?

@ChrisJefferson
Copy link
Contributor

The IO package handles this. The appropriate function is CheckChildStatusChanged(pid,status) -- pass a process ID and the return value of waitpid, A non-zero return value means "That was one of my children".

@embray
Copy link
Contributor Author

embray commented Dec 3, 2018

I don't know what you mean "The IO package handles this". The problem is that merely initialing the GAP library should not, at least optionally (as in this PR), register any signal handlers in the first place.

For cases like calling StartChildProcess it might be nice if rather than clobber any existing SIGCHLD handler it would wrap an existing one, but I don't have strong feelings about that. I'm more concerned that it gets registered even if it's not immediately needed.

@embray
Copy link
Contributor Author

embray commented Dec 11, 2018

What I mean is, what would be the appropriate context in which to call CheckChildStatusChanged? My own SIGCHLD handler? When should I install such a handler? Should this be exposed by the libgap API?

I wonder if it would make sense if, rather that unconditionally installing this SIGCHLD handler, it only did so upon an initial call to StartChildProcess, and even then it would take care to wrap any existing SIGCHLD handler as well, and uninstall itself once all child processes started by StartChildProcess have been reaped. It would also help, in the library use case, if it avoided the "Collect up any other zombie children" loop, since it seems to cause problems in processes that have child processes having nothing to do with GAP.

@ChrisJefferson
Copy link
Contributor

CheckChildStatusChanged should be called if you install your own SIGCHLD handler, to pass information about any SIGCHLD events you receive to GAP. It could be added to the libgap-abi. We could also add some kind of global switch to stop the signal handler being installed.

Unfortunately avoiding the "collect up any other zombie children" loop, or removing the SIGCHLD once we are done, is a major change, because GAP at the moment doesn't track children it doesn't care about anymore, but just throws all information about them away.

Note that this won't solve problems with the IO package, which installs it's own signal handler. It overwrite's GAPs, and then uses CheckChildStatusChanged to pass signals to GAP (we do it that way around, which might seem strange, because the IO package does keep track of the children it cares about, so it can check all children it cares about before forwarding any remaining children onto GAP). However, this is only installed if you use the functions in the IO package which create children, like IO_fork, which I suspect will create other horrible issues in Sage anyway, as forking GAP in Sage would be, I guess, a bad idea?

@embray
Copy link
Contributor Author

embray commented Dec 14, 2018

However, this is only installed if you use the functions in the IO package which create children, like IO_fork, which I suspect will create other horrible issues in Sage anyway, as forking GAP in Sage would be, I guess, a bad idea?

I'm not too worried about the IO package right now since it's not part of the standard GAP install in Sage. But it is important to deal with sooner or later.

I don't think "forking GAP" is such a problem. When GAP is just being used as a library you're not "forking GAP", you're just forking the whole process, which happens all the time for various reasons and generally works fine.

@olexandr-konovalov
Copy link
Member

This has to be rebased.

@embray embray force-pushed the api/optional-signal-handling branch from 8a9207f to 99fe35c Compare January 7, 2019 15:02
@codecov
Copy link

codecov bot commented Jan 7, 2019

Codecov Report

Merging #3072 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3072      +/-   ##
==========================================
+ Coverage   85.16%   85.16%   +<.01%     
==========================================
  Files         696      696              
  Lines      344234   344235       +1     
==========================================
+ Hits       293159   293160       +1     
  Misses      51075    51075
Impacted Files Coverage Δ
src/libgap-api.h 50% <ø> (ø) ⬆️
src/gap.c 81.52% <100%> (ø) ⬆️
src/libgap-api.c 25.87% <100%> (ø) ⬆️
src/system.c 73.81% <100%> (+0.08%) ⬆️

@embray
Copy link
Contributor Author

embray commented Jan 7, 2019

Rebased. I still think handleSignals=0 should also prevent the iostreams module from installing its SIGCHLD handler, but I'm not sure the best way to go about that.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 83.116% when pulling 99fe35c on embray:api/optional-signal-handling into 823bd43 on gap-system:master.

@coveralls
Copy link

coveralls commented Jan 7, 2019

Coverage Status

Coverage increased (+4.0e-05%) to 85.001% when pulling 1ba3127 on embray:api/optional-signal-handling into 2013da1 on gap-system:master.

@olexandr-konovalov
Copy link
Member

What's the status of this? This has to be rebased (again)...

@embray
Copy link
Contributor Author

embray commented Jan 31, 2019

Mostly just waiting on approval...

My question in #3072 (comment) could use some response if anyone has any ideas. I don't know what the right way is--if any--to pass flags set via InitializeGap(...) down to other kernel modules.

But resolution of that issue can still come separately, as this already partially fixes the issue.

@olexandr-konovalov
Copy link
Member

@embray it can not be approved without rebasing and then analysing up-to-date CI tests.

@embray embray force-pushed the api/optional-signal-handling branch from 99fe35c to 74ff0a5 Compare January 31, 2019 15:23
@embray
Copy link
Contributor Author

embray commented Jan 31, 2019

Rebased, anyways.

@olexandr-konovalov
Copy link
Member

olexandr-konovalov commented Jan 31, 2019

Now test fails:

make testlibgap
   C       tst/testlibgap/basic.c => obj/tst/testlibgap/basic.lo
tst/testlibgap/basic.c: In function ‘main’:
tst/testlibgap/basic.c:8:5: error: too few arguments to function ‘GAP_Initialize’
     GAP_Initialize(argc, argv, 0, 0);
     ^
In file included from tst/testlibgap/common.h:4:0,
                 from tst/testlibgap/basic.c:4:
./src/libgap-api.h:154:6: note: declared here
 void GAP_Initialize(int              argc,
      ^
tst/testlibgap/basic.c: At top level:

@embray
Copy link
Contributor Author

embray commented Jan 31, 2019

Yes. That's weird. I thought I fixed those.

@embray
Copy link
Contributor Author

embray commented Jan 31, 2019

I see. I have a local copy of this branch on a different computer where I did fix that.

@embray embray force-pushed the api/optional-signal-handling branch from 74ff0a5 to 1ba3127 Compare January 31, 2019 18:13
@fingolfin fingolfin merged commit eab7111 into gap-system:master Feb 1, 2019
@fingolfin fingolfin added topic: kernel release notes: not needed PRs introducing changes that are wholly irrelevant to the release notes backport-to-4.10-DONE topic: libgap things related to libgap and removed backport-to-4.10 labels Feb 1, 2019
@fingolfin
Copy link
Member

Backported to 4.10 in 9b73f71

@jdemeyer
Copy link

There is still the SIGCHLD handler which needs to be dealt with.

@embray embray deleted the api/optional-signal-handling branch March 25, 2019 13:24
@embray
Copy link
Contributor Author

embray commented Mar 25, 2019

@jdemeyer Yep, I never got a satisfactory answer on what to do about that.

@jdemeyer
Copy link

Note that the problem with GAP's SIGCHLD handler is different: Sage doesn't need any particular SIGCHLD handler, so the problem is not that GAP overrides a pre-existing SIGCHLD handler. The problem is instead with the handler itself: it calls waitpid() on all processes, even those that have nothing to do with GAP. This is a pretty severe problem and breaks most use cases of child processes.

@ChrisJefferson
Copy link
Contributor

Can you explain the problem with calling waitpid on all processes?

@ChrisJefferson
Copy link
Contributor

Unfortunately, a very long-standing GAP issue (which is, I believe, unfixable at this point, as too much code relies on the current behaviour) is that GAP spawns child processes which it simply "forgets" about. Obviously when those processes end, they become zombies. Therefore at some point those processes have to be cleaned up with waitpid, else we will run out of slots.

Does sage require nothing ever calls waitpid(-1,...)? I'm afraid if that's the case, we may have a very difficult to fix problem.

@jdemeyer
Copy link

Does sage require nothing ever calls waitpid(-1,...)?

Absolutely. It is quite reasonable to waitpid() for a process that you created, for example to wait (literally) until it is finished or to get its exit status. But you can't wait twice for a process: if GAP does that, then Sage can no longer do that.

@jdemeyer
Copy link

GAP spawns child processes which it simply "forgets" about.

Is that really so difficult to fix? Are there that many places where GAP spawns child processes?

@embray
Copy link
Contributor Author

embray commented Mar 25, 2019

Note: This is not just about "sage" but any software that integrates GAP and that happens to spawn processes. E.g. a generic python-gap interface would have severe problems when trying to use Python's multiprocessing module, because of this.

This will be a problem for gap-julia as well.

@ChrisJefferson
Copy link
Contributor

One could in principle find all the places where the kernel (and kernel extensions) stop keeping track of processes, add all those PIDs to a list, then go through that list calling waitpid on every element. However, that would obviously be a lot more work than using waitpid(-1) when GAP is being used as normal.

@jdemeyer
Copy link

Where are those places where a child process is created and then forgotten? I tried to look for that, but couldn't find anything in the GAP sources (excluding HPC-GAP).

There is only one place where fork() is called: in StartChildProcess in src/iostream.c. There, the PID is stored in PtyIOStreams which is handled separately in the SIGCHLD handler (so that's OK):

static void ChildStatusChanged(int whichsig)
{
    UInt i;
    int  status;
    int  retcode;
    assert(whichsig == SIGCHLD);
    HashLock(PtyIOStreams);
    for (i = 0; i < MAX_PTYS; i++) {
        if (PtyIOStreams[i].inuse) {
            retcode = waitpid(PtyIOStreams[i].childPID, &status,
                              WNOHANG | WUNTRACED);
            if (retcode != -1 && retcode != 0 &&
                (WIFEXITED(status) || WIFSIGNALED(status))) {
                PtyIOStreams[i].changed = 1;
                PtyIOStreams[i].status = status;
                PtyIOStreams[i].blocked = 0;
            }
        }
    }
    HashUnlock(PtyIOStreams);

#if !defined(HPCGAP)
    /* Collect up any other zombie children */
    do {
        retcode = waitpid(-1, &status, WNOHANG);
        if (retcode == -1 && errno != ECHILD)
            Pr("#E Unexpected waitpid error %d\n", errno, 0);
    } while (retcode != 0 && retcode != -1);

    signal(SIGCHLD, ChildStatusChanged);
#endif
}

There is also one vfork() call: in SyExecuteProcess in src/sysfiles.c and that actually needs to work around the SIGCHLD problem:

    /* turn off the SIGCHLD handling, so that we can be sure to collect this child
       `After that, we call the old signal handler, in case any other children have died in the
       meantime. This resets the handler */

    func2 = signal( SIGCHLD, SIG_DFL );

I don't find any calls of posix_spawn. Are there any other fork()-like system calls?

@ChrisJefferson
Copy link
Contributor

ChrisJefferson commented Mar 25, 2019

There have been moves in the last few versions of GAP to peck away at these problems. Having a look, we might actually have about cleaned most of them up.

We do use popen in the profiling code, but I think it cleans up after itself.

The big remaining problem is IO_fork in the IO package. The IO package calls waitpid in a list, and makes a list of the PIDs it finds. GAP needs to "clean up" the PIDs of the IO package makes if it's signal handler is uninstalled as well.

@fingolfin
Copy link
Member

Folks, discussing these things on a PR that was closed almost two month ago is a great way to make sure few people will see it or participate...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-4.10-DONE release notes: not needed PRs introducing changes that are wholly irrelevant to the release notes topic: kernel topic: libgap things related to libgap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants