-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conservative stack scanning? #11714
Comments
The only thing I know about conservative vs precise stack scan is based on this and I quote.
I know that we are propably even further from compacting but I'd like to know what the experts on these think... |
Our large amount of C interop makes it fairly unlikely that we'll ever be able to do moving GC. |
It's not clear that we need to move objects. If we did, it would already require a lot of work, since the current rooting API assumes non-moving. That article seems to assume that moving is required for generational GC, but that isn't strictly true. You only need it for compacting or for a copying implementation of generational gc. For us it's possible the conservative stack scan would be a bigger win than moving objects, especially given how important C interop is. |
To the pros I would add that we would be able to do many optimizations we do for bitstype on immutable types with no additional work. That's huge. On the moving debate : we will never be able to do a fully relocating collector, because of C interop as Stefan said. However, it would be perfectly ok to do a so called "mostly copying" scheme, and we would need conservative stack scanning (to pin "C" objects) anyway. The thing I don't agree on is that the chance of something on the stack looking falsely like a pointer is slim. The problem is that the stack is never cleaned up, so as it grows back you can end up overwriting parts (because of padding, alignment) of what was a pointer before. If you overwrite the LSB part you are on the contrary almost guaranteed to point inside a (different) julia object. On the implementation side, the part taking care of pooled object is already "done" in my PR (which is where I encountered the partial pointer stuff btw). It's the easy one though because they are fixed size. To do this for "big" objects we need a datastructure to query if a pointer falls into a range of memory. |
We can only properly evaluate false retention once we have a full implementation, and measure memory use in real workloads. Until then I'm not worried. |
As we already discussed, I'm not worried too much about the memory usage of false retention, but about the angry "my finalizer was not run after gc()" crowd. Anyway, you're right, we would need to implement big object conservative scanning first to see it in real life. |
Well, you know how I feel about that. The only real solution is to use finalizers less and rely on them less. |
@JeffBezanson If you don't have reliable finalization, how do you handle externally managed memory reliably? Since C interop is so important, you need some way to be able to call C release code before something is GCed. Or am I missing something? |
@carnaval I agree that there'd be a high probability of seeing false positives of things looking like valid pointers still on the stack. |
"high probability" of seeing at least one when the programs run, sure. Now the thing which is hard to evaluate is : how long would false positive remain on the stack (after all if it is freed at the next call who cares), and with what density. |
Depends a lot on the application... if you have processes whose lifetime is measured in months or years, with the amount of stack fluctuating a lot, I think it might be more of an issue... |
How reliable would it be to insert some mark on the stack to distinguish managed code and unmanaged code? @carnaval Is the "almost moving" scheme a compacting collector that pins everything on the stack? |
You can mix conservative and precise stack frames, but the point of this is to get rid of the gcframe entirely, both in the runtime and the generated code. We can also have conservative scanning of C frames and precise scanning of julia frames using the llvm statepoint thing (more work). The mostly copying stuff is a general technique, you just have a part of the roots you are not sure about. It can be the whole stack or only a part of it, or even some heap objects you wouldn't know the layout of (although we don't need this). The canonical mostly copying (I know of) is http://www.cs.cornell.edu/home/fms/mcc/bartlett.html if you want to use copying as a marking segregation ("replacing" the marked bit / mark queue). Copying has other uses : compaction, generation segregation, ... |
@carnaval Thanks for the explaination (and the pointer). I didn't know that the rooting can be fine grain classified like this before. My feeling before was that compacting and the growing ability in LLVM to identify GC roots are the two things that are incompatible with conservative stack scan. Good to know that both of them are still possible. (Although they are still as hard to implement as they currently are now...) |
It's important that if we choose to do conservative stack scanning now we get all of the benefits immediately and can do more work to get back partial ability to do moving GC – which we're not even doing right now. |
I totally agree and I find it very promising the first time @carnaval mentioned it to me. It's just that when I want to find more detailed study about it, the internet is flooded with moving collector and there's not even many places mentioning non-moving generational collector..... |
I'm trying to get a better understanding of this concern. Is the issue that you might have a pointer stored in a register or stack frame and then reuse the low byte or something like that and end up with the same register holding something that still looks like a pointer but points at something else entirely? I should mention that at some point I had talked with @carnaval about wanting the ability to have interior pointers into an object pin the whole object (particularly useful with arrays). That would obviously make it much more likely that some accidentally pointer-like value would pin an object since you don't have to point exactly at the head of the object – pointing anywhere inside of it pins it. But I think the benefits of conservative GC outweigh that, so I'd be willing to forgo interior pointers for conservative stack scanning. |
In that case, are we sure that the compiler optimization will not optimize a store of the original pointer out if we do not explicitly root it? |
Right, that's another concern – can we force LLVM to store pointers in a recognizable form on the stack? |
But then this doesn't feel so much different from the current approach. I would expect a naive implementation to have the similar performance hit with the current rooting. It is probably a different story with LLVM gc support but I guess that's not even on the table at this point.... And we would also need to do the same with our c/c++ code. |
You have to conservatively scan registers too. |
jsc (WebKit's JS engine) actually uses a Bartlett-style collector with LLVM, so it can be done. Regarding false retention, their blog post says (under "Integrating LLVM with garbage collection"):
But Julia workloads are probably a bit different from JS workloads. |
Thank you, though obviously I also understand that your priority has to be the long-term health of Julia. I don't actually need a solution right now (we're on what is essentially a 12-year project) and GAP is just one part of the whole thing where interoperability is a concern (which is why I'd like to focus on more general issues of interoperability), but we're at a stage where we're developing a roadmap so we need to figure out our options and their respective viability.
This is something that I had actually thought about. A minimal set of hooks that permitted this would be the following (I think):
The first three would be generally good ideas to have for better C interop, I think – and I'm planning on writing up a separate feature request for those –, but the last two would expose internals of Julia and would at a minimum have to come with a big "can be deprecated at any time" warning label (any such external library would have to depend on some implementation details of the Julia GC, anyway). |
What more alternative do you need? There's
You also need to patch your code to issue write barrier. |
The current safepoint mechanism (again, correct me if I'm wrong) only allows for one point in time to be marked as a safepoint, not an interval. Like JIT-compiled code in the JVM, it will do a forced load from a specific page that triggers a segfault and the signal handler will then proceed to report to the GC thread that it is no longer working on any GCed memory. However, this does not account for cases such as long-running computations in third-party libraries or I/O where you cannot inject regular calls to
Obviously. I was talking about what was needed on the Julia side of things, and |
That's gc state transition and we do have that with
What I mean is that you cannot do this without patching the C code. Not even with conservative stack scanning. |
Ah, thanks for the correction.
Yes, I am well aware of that. |
It's unclear who you're replying to here. |
The comment right above it. In particular, the first sentence which is what the comment is suggesting AFAICT
|
OK, so this why I don't think that's a good reason for doing conservative stack scan. FWIW, there's usually many more heap mutations than local variables (obviously there's more stack mutations but the Other requests all more or less make sense,
This is what a simulated ref counting interface is used for. A way to add callbacks to supply root is fine too but I think a ref counting API is easier to use.
So we already do. Incidentally, this is also why I don't want to have an API for the next one right now, since this is the part that still needs a lot of tweaking.
This actually hooks much deeper into the GC than any other ones. It's in principle doable but certainly not via a stable API anytime soon.
Actually not sure what this one means.
I assume this is either (1) or conservative scanning. So this would be no or covered.... |
For what it is worth, GAP already has issues with GMP (a dependency). Their solution at present is to reallocate GMP integers before each call to the mpz interface so that no reallocation occurs within calls to GMP. It's probably technically feasible to do a similar thing with MPFR, but it is not a general solution for all C libraries, obviously. |
I have to disagree here, based on my own experience with working with code that uses manual insertion of write barriers. Importantly, the proper use of write barriers is amenable to bespoke static verification even in C code (bespoke, because you may have adjust your tools for the specifics of how a particular piece of C code represents data), which incurs no runtime penalty other than the write barrier itself. This is different from explicitly exposing local variables.
Both approaches have their pros and cons, obviously, but a callback has the advantage that the roots can be computed lazily, i.e. if they get updated more frequently than collections occur. In any event, I think this may be a discussion for a different place, once I get to write up a feature request for it. |
Runtime penalty is not what I'm talking about. I'm only talking about
which is the only correctness issue that doesn't already have a clear solution that doesn't require conservative scanning and I'm just saying that conservative scanning can't do that either so that's not a good reason to do it. For performance, I'll repeat that we should not do conservative scanning unless we've tried all other solutions. |
Very well. There are three distinct issues here:
As for performance, conservative stack scanning avoids separate GC frames entirely. Regarding verification, the correctness of write barriers is largely a local property (and can in the absence of tools be also ascertained manually through code review). In order to ensure that a sufficient subset of local variables is put on the GC stack along all possible program paths, you need global analysis, probably combined with interprocedural dataflow analysis. This would have to work with multiple source files and function pointers. I believe that ease of adaptation is also better (by which I don't mean it's trivial). Recall that we only need a write barrier when we're writing a pointer to a heap object. Assuming we can supply GC roots, we do not need a write barrier for assigning to a global or thread-local variable; this already reduces the amount of code that we have to deal with. Also, the write barrier only has to be performed before the next safepoint, which gives us a fair amount of leeway. For local pointer variables we have to do more; recall that I'm not just talking about calling C code from Julia, where Julia references only occur as arguments at defined entry points of a module, but C -> Julia -> C -> Julia call chains and where foreign objects or global variables can contain Julia references. This means I cannot just guard the entry points to a C module, but – assuming that I don't want to use
I don't disagree insofar as Julia code itself is concerned. But I am talking about the performance of foreign code here, which is a separate issue. |
Local variable rooting is also pretty local since they have defined behavior at function call site. write barrier is also not as local since there are cases we remove write barrier due to known old/young objects.
This is not going to be the case.
From local information, you can do only a subset of it. |
When it comes to GMP and MPFR, we've considered having these types implemented as native Julia objects instead and only calling out to the libraries for complex operations. And of course, there's a ton of performance to be gained by immediate representation of smaller values – which presumably is what you're using the C-union between pointer and immediate values with tagging for. We should really work towards having efficient Julia-native support for this and only call out to GMP/MPFR when necessary. That would eliminate a lot of these problems in a much cleaner and more efficient way. |
Unfortunately, that is often insufficient. Without looking at callers, any argument to a function can be the sole live reference to an object, which means that if you're using purely local analysis, at a minimum all pointer arguments need to end up in GC frames. A simple example is: f(pop(&global_stack)); Generally, we can eliminate
Write barriers are also relatively rare, as they are only need for pointer writes to the heap (for example, GAP, which has its own generational GC with manually inserted write barriers, has about 700 write barriers in some 160 KLOC worth of C code). It is also less of a constraint for the optimizer, as a write barrier can be just another read+write to the same part of the heap that was already updated (often the same cache line), whereas GC frames interfere with register allocation and live variable analysis. And if there's a performance problem along a hot path with write barriers, that can still be manually tuned. I also note again that precise stack scanning doesn't help with write barriers, which is an orthogonal issue that arises on top of finding GC roots stored in local variables. Even if we adopted precise stack scanning for foreign code, the insertion of write barriers would still be necessary. |
This is invalid.
Right. And even if you implement conservative scanning, the insertion of write barriers would still be necessary too so
Isn't true. |
Hmm. I believe that we may be simply having some miscommunication here.
By rewrite I literally meant rewrite, as in wholesale alterations; I don't mean that no alterations are necessary, just that they are kept at a manageable level. Not needing any changes at all would be unrealistic: even dropping in a fully conservative GC (such as the Boehm GC) cannot generally be done entirely without changes; you may have to worry about zeroing a free list to prevent leaks, for example. The rationale behind this is the software engineering cost involved, which need not be zero, just reasonable. Of particular interest is the case where one controls the upstream version, too, and Julia integration is essentially a configuration flag. I can see how this could be read differently, but that's not what I was getting at. |
OK. Still. I think #11714 (comment) is a better way forward since it is way too early to expose any part of the julia GC to a public API. |
@rbehrends: my understanding from what you said previously was that for at least some of your dependencies, literally no changes could be made – down to the shared binary library level. If that's not the case, that certainly opens up many more options. |
On Wed, May 17, 2017 at 11:45:19AM -0700, Stefan Karpinski wrote:
When it comes to GMP and MPFR, we've considered having these types implemented as native Julia objects instead and only calling out to the libraries for complex operations. And of course, there's a ton of performance to be gained by immediate representation of smaller values – which presumably is what you're using the C-union between pointer and immediate values with tagging for. We should really work towards having efficient Julia-native support for this and only call out to GMP/MPFR when necessary. That would eliminate a lot of these problems in a much cleaner and more efficient way.
That is what GHC does in its integer-gmp library
|
@StefanKarpinski It's not that we're dealing with a single situation. We will likely have to deal with both dependencies where we indeed can't change a thing (GMP has been mentioned) and where we have to work around this limitation, but we'll also have code that's custom-written for interacting with Julia where we can do whatever we want; for yet other dependencies we may have different stakeholders with different (and possibly conflicting) goals involved. But that's why I wanted to keep focused on the technical aspects of FFIs, with a looks towards having some flexibility in how to deal with interoperability. Having more or more flexible options generally means that one is less constrained by backwards compatibility concerns, software engineering constraints, or (in the worst case) project politics. |
Having to provide more flexible options also means that julia itself will have to be constrained by all of the above. As a recent example, I couldn't have done #21590 without breaking public API or having to emulate it if any part of the GC marking internals are exposed. For interfacing with C, I believe the best option is to use them like C. The C libraries that are already writen are usually writen with C memory management in mind and are meant to be used that way so I believe we'll get the largest benefit if we can provide better support for that and that's exactly #11714 (comment) For C libraries that can be modified and adapted to julia should do that by incrementally porting the C code to julia. Works are being done to make that easier (e.g. #21912, which will make it much easier to declare and manipulate nested C structures). Even for libraries that can't link with julia, if it has reasonable low level API we should be able to port away from it's high level APIs and implement some of those in julia. This is effectively what we want to do with BigInt Both of the two ways does not rely on deep changes to the language and can benefit all users instead of one particular use case. On the other hand, if we try to efficiently support the memory management of one particular C library, it'll put a lot of constraint on what we can do/change in julia. Note that this is talking about literally the lowest level implementation in julia since the GC is the only part of the runtime that doesn't call another other major parts (obviously other than true external dependencies). There's also usually no standard ways to do all kinds of advanced tricks (e.g. tagged pointers) in C so having direct support of it would not be useful for FFI in general unless we decide to also make use of it in julia itself for other reasons. We are quite different from many other scripting languages like python in that you should write glue code in julia and not C. This is possible due to the JIT and is also why pypy prefer cffi rather than cpyext. If there's something missing that makes it harder to wrap C code efficiently, the preferred way to fix it is to support that in julia and not exposing more C apis (i.e. the two ways mentioned above). |
@yuyichao I think our project has quite different requirements than interfacing with a C library. Gap is written in C, but it is itself a garbage collected (interpreted) programming language with additional dependencies. A large amount of effort has already been expended by @rbehrends to write bespoke tooling to insert write barriers in the Gap codebase (@rbehrends is the principal architect of HPC Gap). You've indicated that it may be too early to formalise any Julia GC API. But my gut feeling is that we may still be better offering to contribute to the design and implementation of an experimental/unsupported interface in a branch, rather than abandon entirely the approach @rbehrends has been steering us towards. @StefanKarpinski Our project is a collaboration between existing projects, including those of us who are already implementing in Julia. We don't have control over the repository of Gap or its dependencies. Some of their key developers are excited about Gap<->Julia interop and visited us for a week to hack on some potential ways to do that. |
So it cannot be used in C? |
@yuyichao That is one of the things we investigated with the Gap developers when they visited recently. It turns out that there is a project called libgap, which will make it possible to call into Gap from C, essentially. We prototyped calling this from Julia (without GC considerations for now), and calling Julia from Gap (again, in a simple way, without GC considerations). Most of the functionality in Gap is actually implemented in the Gap language, not the C kernel of Gap. This is one of the things that makes Julia so attractive for them. Gap is not JIT compiled, but interpreted, in a very dynamic way. Julia would be fantastic as a Gap library language for serious developers. |
I should probably add that Gap is only one of a number of languages that we will make use of in our collaboration. Another is Singular, where @rbehrends has again invested much expertise and time into concurrency for us. There is also Polymake, which is basically a hacked version of Perl 5. So Gap should be seen as an example of something more general, not a singularity. We do also have a number of large C libraries that we are also using, so we of course care about good C interop too. So the suggestions you and @StefanKarpinski are making with regard to C are also of interest to us. |
This might be a bit out there, but have you considered writing a Gap interpreter in julia? |
@Keno Yes, in fact a Gap developer actually started working on such a thing, but unfortunately abandoned it. At present we haven't allocated funding to do such a thing for Gap, but we have allocated funding to do such a thing for Singular, which is a much simpler language. The real problem with Gap is the exceptionally dynamic nature of its type system. It doesn't map 1-1 with Julia. The much more interesting problem, for them and us, appears to be using Julia as a library language. Having said that, there is great interest in rewriting a mathematical language, called Cap, which is currently implemented in Gap, in Julia. That will likely go ahead, from what I heard most recently. Sorry to get right off topic here. |
This and related approaches (such as JIT-compiling GAP code or having a statically typed and compiled extension of the GAP language) have been considered in the past, but so far, this has gone nowhere. Challenges would be:
I'll also note that if it were just about GAP <-> Julia interoperability, we could simply ship a patched Julia with GAP (enabled to support conservative stack scanning). But the Oscar platform is intended to be more modular and extensible than that, so we'd like to get interop right. (Separately, there's still the current issue of performance related to GC frames within Julia proper, which is an independent concern, but my understanding is now that these issues will be addressed through LLVM-supported precise stack scanning.) |
Instead of maintaining our own GC stack frames – which is slow and easy to mess up, likely accounting for a large portion of our tough-to-track-down core Julia bugs in the past years – we've occasionally talked about doing conservative stack scanning. That is, scanning the stack and conservatively assuming that anything that happens to look like a pointer to a valid heap-allocated Julia object actually is.
Pros:
throw
#11508, Throw exceptions with arguments without creating a GC frame #11284)Cons:
The second con seems like it can be avoided with the same kind of rooting we do now, except only needed at the ccall entry point. The chances of the first con happening – i.e. random values in the stack pointing valid heap objects – seems pretty slim. There might be situations like when you take a pointer to an array and then use it – but in that case you probably want the array to remain rooted while someone has a pointer to it. Are there other potential cons that I'm not seeing?
Related: #10725, #10317.
cc: @carnaval, @vtjnash, @JeffBezanson
The text was updated successfully, but these errors were encountered: