-
-
Notifications
You must be signed in to change notification settings - Fork 411
Alternate precise GC implementation #1057
Conversation
60ba003 to
08a647b
Compare
|
Do you have the |
No, the RTInfo is only generated for bare structs and classes. But "shared" could be available in the type info passed to the GC in most invocations.
With the current way to work with shared (i.e. to cast it away while being protected by some mutex), I don't have a clue how this can work with thread local operations. Also note that immutable is implicitely shared. |
It seems like the only important part would be when using
immutable objects would be unable to carry thread-local instances (as per the D syntax)... which seems to be the case already Of course, there doesn't seem to be much to do other than branch off the allocate function, and change the Gcx to show that the local GC doesn't require locks, signals nor "stop the world"... |
How is this going to work? Consider a shared array shared(T[[]) locked by a mutex, then some factory function newing T without being aware of sharedness. The thread local T is added to the array casted to T[]. A thread local GC would not find that reference and free the instance of T. |
That absolutely can't be permitted by the compiler. The compiler has to force all underlying types to be implicitely shared, or they're invalid, ie. |
|
@etcimon This compiles with no errors, and is not a bug. void main()
{
shared x = new Object();
} |
It should be one. Leaving this semantic check out makes no sense. Storing a thread-local pointer on shared storage is like sending a local string pointer to a shared DB for later. You can jump through many hoops and loops to make it dereference properly, but you'll need some really tight coupling to keep it working. This is exactly why the GC is currently shared and locks/stops the world even for local allocations: someone believed local pointers should be allowed to be in shared storage. |
|
It was designed for the single global GC that the language currently requires. Another example is this: Object maker() pure
{
return new Object();
}
void main()
{
immutable x = maker();
}The function that calls Neither of these examples involve casting or other explicitly 'unsafe' features. I don't see how a proposal for thread-local heaps that simply declares these use cases invalid and breaks existing code would ever be accepted. |
It doesn't need to break existing code. The dual tls/shared support in GC can be Object maker() pure
{
return new Object();
}
void main()
{
immutable x = maker().idup();
shared y = maker().sdup()
}There are countless performance issues that can be solved simply by avoiding the locking. D could beat Java in the benchmarks with this change alone. http://forum.dlang.org/thread/jhbogjnxmcpjmemgaigs@forum.dlang.org |
|
The tests fail because of https://issues.dlang.org/show_bug.cgi?id=8262 |
I think most of the locking for allocations can be removed with a few CAS operations or just thread-local memory pools that are still scanned by the "global" garbage collection. |
scanning involves locking as well if the garbage collection isn't thread local |
Only seldomly for a short period of time if scanning is done concurrently. I agree that a thread local GC would be nice to have, but I don't see how it fits with the current language. |
The best way to achieve a thread-local GC would be to improve and enforce If even a 30% speedup is possible down the line, it's worth it. The more threads, the more improvements. I have over 90k lines of D code to maintain and I'm ready to put in the work for this on druntime as well. |
|
I actually implemented a bloom filter at some point and tried to use that for eliminate false pointers, it was much slower and only found very few false pointers in the gc benchmark suite. Almost all false pointers were already catched by the The other thing you can do with a precise GC is to skip non-pointer field in an allocated object. |
That's also my experience, but it depends a lot on how much memory is actually used and how much random data is scanned.
Considering that 32-bit programs are quite a bit faster (e.g. the benchmarks take 43 sec instead of 52 sec for 64-bit on my system), people might still prefer it sometimes.
Knowing the exact size of an object or array also can help a lot. That can avoid scanning up to half of the allocation block. We might also skip zeroing that area during allocation.
Yeah, it's pretty hard to predict what version is better. It's also system dependent. The TypeInfo/RTInfo is probably used quite often so it should also be in L1/L2 cache usually. |
It will be crucial to store the bitmaps very efficiently. |
|
It might also be interesting to add some statistic option to the GC that reports what kind of data is allocated (currently classes, NO_SCAN and the rest, later also structs and arrays) and the mean and standard deviation for the percentage of pointers in each allocation (grouped by kind and size). |
I'm not so sure. Most of the bitmaps will fit into a cache line anyway. Maybe it's even better to store a series of bytes with alternating "number of 1" and "number of 0" so you can jump easily over areas without pointers. Alternatively, we might just check the bitmap after the
That could also be the task of the linker (identical comdat folding on the RTInfo). |
Right, but compressed they will fit in fewer cachelines, so you'll cause less cache pollution during GC scanning.
Will be tricky to come up with an efficient RLE, anyhow that's fairly advanced and should only be done if necessary or as an optional optimization step.
The deduplication is mandatory IMO and ICF only works with Microsoft's linker (currently disabled due to Issue 10664), so please handle this explicitly. |
See p. 8-10 http://www.slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them, that's fairly trival, efficient and should be close to optimal (though we don't know the distribution). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32-byte for each type? The bitmap will only be 1-byte in many cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this size_t and not ubyte? How many different mark function do you have?
Maybe it's better to turn this into a ubyte* to [MarkType, length (varint), bitmap bytes] or [MarkType, bitmap bytes, sentinel].
We currently have about 8000 types in the unittest libphobos and 2000 in the release one, that's quite a number of relocations to add. Would be great to use non-global symbols here (visibility hidden) to avoid relocation during loading, but that's currently not possible in D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32-byte for each type? The bitmap will only be 1-byte in many cases.
Even if it the bitmap needs only 1 byte, alignment will make it 16 bytes. I chose size_t[] because of the handling in markPrecise, but that might could be changed depending on what's faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this size_t and not ubyte? How many different mark function do you have?
There are currently 3: default, dynamic array, assoc array. But as there is no "emplace" in this version of the GC, I'd like to allow the user to specify a mark delegate to scan void[N] fields precisely according to emplaced types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ubyte has an alignment of 1-byte.
Well then you could have a forth type meaning user callback, that would be the only one requiring alignment.
One other option would be to generate mark function per type by string mixins. That's an idea that @andralex wanted to persue. It will wreck compilation speed, though. I currently suspect most of the performance being lost in the benchmark by a lot of indirections when marking arrays of strings that iterates over small types (char[] = { length, ptr }). |
|
While working on the (array of) struct destructors I realized that the array marking needs at least 2 versions: one for the array layout in lifetime.d, and one for plain arrays as created by std.array.Appender. Maybe RTInfo!T should be evaluated in the scope of the type to allow specialization. That could open some options to setup specific mark functions without implementing everything in the runtime. |
|
why should the runtime be concerned about a phobos type's misuse of the runtime? |
If it is a common idiom (e.g. Appender) to allocate arrays of objects like this, we need a way to specify how to scan it. It might also be C malloced memory added as a gc-range. Even if I wouldn't care if Appender is dumped, I still think that if precise scanning is done through the TypeInfo, a struct/class must have the ability to specify how a class emplaced into a field has to be scanned. The other precise GC does this by modifying the pointer bitmap stored by the GC, this version needs to specify a mark callback function. |
|
OK, as long as the marking function is defined and maintained in Phobos, that sounds correct. |
|
As far as bitmap compression goes, I will say that there is nothing as far as I know in the spec that prevents the compiler from re-ordering the fields of a non-decorated (non- |
|
Because the discussion came up again, a summary of my points:
I think there are lower hanging fruits to speed up the GC, e.g. background/parallel finalization, vectorized (SIMD) loop in mark. |
|
A few answers interspersed with as many questions (I'm not very familiar with the current implementation).
Indeed I could not find much information about that. Searching for quoted "false pointers" yields little data (a subset of which points to discussions within the D community), because so few languages have them! I fear this might be a political issue - we're the only language that sticks with conservative GC when all other languages and systems use precise GCs and build upon them.
A more interesting case is structures with pointers on one cache line and no pointers on the next one. Consider: struct Customer {
string lastname;
string firstname;
string address;
string zipcode;
int id;
char[8] ssn;
...
}An array of Another question: what is the cost per word to decide whether it is a real pointer or a false pointer?
I agree bit patterns are not the best way to go about characterizing data shapes. We've discussed in the past the variant with generating a mark function per allocated type. So instead of bit patterns we have specialized code that knows how to scans types (and arrays thereof) transitively.
In the scheme sketched above, the only indirect calls are for roots. All others are direct calls because the functions are generated with full type information.
At some point surely prefetching will become slower than loading a fraction of the pages. I think we'd need to test this. Until then I think it's difficult to base decisions on it.
Historically framing things as education issues are problematic. Also, often data schemata grow by accretion - one member here, one member there, and each member grows in size with adding features to the code. I don't think we can count this as a strong argument.
That'd be great indeed, but has a hard limit on what you can do. One theme in precise GC is that there are more levers to improving it than conservative.
For the most part these don't compete with precision and would improve either approach. |
|
Erm, the point about false pointers is factually incorrect. While yes, in theory, a 64-bit address space should "solve" the false pointer problem, the reality is, as always, a little more complicated. There was a recent thread on the forum by someone doing big-data work. He ran into a false pointers problem in a 64-bit app. http://forum.dlang.org/thread/mafpzsuiuhzlriixymvy@forum.dlang.org My guess is that the problem is that while the OS allocator could randomly map addresses inside the virtual address space, it is not. I strongly suspect that the behavior is more along the lines of: find free space in the physical memory then map the address to the next available block of contiguous free space in the virtual memory space. This would necessarily involve mostly filing up the 32-bit sized space before spilling over into 64-bit space. As we move into an era of larger data-sets and the need to accurately work on them, I fear that the mantra of "64-bit solves false pointers!" is going to be proven false at an ever increasing rate. |
|
A MAJOR win for precise or mostly-precise GC is the ability to compact all (resp. most) of the heap. This eliminates (resp. dramatically reduces) fragmentation. |
|
Also, the precise GC has already been written – it is probably a good idea to use it! |
I don't really buy this. Just adding std.datetime to the parsed files of the vdparser test shows a varying process memory usage of 400 to 900 MB for the x64 benchmark under win64 (address randomization is likely to cause more trouble than the usual addresses under linux). That doesn't happen with the precise GC. I agree with most of what Andrei said, but this part:
I don't think transitivity works in D: it moves the memory type description from the allocation to the referrer, but that information is incomplete (a class reference doesn't know about derived classes) and ambiguous (multiple referers might use different pointer types, especially in system code). You have to keep the type information for every block anyway (in case it becomes a root), but tracking whether a transitive scan covered the actual allocation type seems duplicate effort. Please note that the implementation in this PR is rather slow due to the number of indirections. Converting the scan function into CTFE generated code (making compilation even slower than the bitmap generation used to be before converting it into a trait) just reduces the number of indirections by one. I still prefer #1022, but it might need some optimizations for saving the pointer bitmap during allocation. Tuning these with dmd's weak optimizations was frustrating, though. Large arrays might use a simple scheme to save the pattern just once. |
Interesting re class references - so for classes you'd need an indirect call (including a hidden virtual method). Regarding storing different pointer types, what cases do you have in mind? For
Couldn't that be significant? |
The void* pointer was my main concern, too. Unions of different pointer types are similar. Most other pointer casts are just temporary and might only become a problem if the stack is scanned more precisely. Other problems:
|
I'm not sure. IIRC it was also painful to have to call back into the GC to follow references. It could help if some of this can be inlined, but it would disallow swapping the GC without recompilation. If there is interest, I can try to rebase this PR for some experiments. |
|
There is a huge difference between false pointers (as in random number pinning an array) and stale pointers on the stack/in registers. The latter being a real problem even for 64-bit, but this PR doesn't solve it.
Mostly a single comparison, https://github.com/D-Programming-Language/druntime/blob/a83323439398a31126a33043bdd6c578bbc46433/src/gc/gc.d#L2022-L2027.
Well code will require even more memory than bit patterns, and required the same deduplication.
Yes, but we have limited resources, with which this does compete.
Let's please keep guesswork out of discussions, it's just wasting the time of everyone involved. |
Yes, compaction is interesting. And fully solving the problem of precise register/stack marking makes this much more interesting. Just precise heap/root marking alone isn't that interesting. |
We certainly can't, but you can easily craft a benchmark to test this (https://github.com/D-Programming-Language/druntime/blob/a83323439398a31126a33043bdd6c578bbc46433/benchmark/gcbench/tree1.d). Also the question is not only what you win for some types, but how much you'd loose for others. |
|
@MartinNowak JavaScriptCore and Steel Bank Common Lisp use a mostly-copying collector, where objects that are pointed to only by other heap objects (the vast majority) can be moved, while objects that are pointed to by registers or the stack are pinned. This provides good performance in practice. So precise heap/root marking is already a big win, even if the stack and roots are still marked conservatively. This is also shown by the dramatic reduction in memory leaks on 32-bit systems. Also, I think that fully precise GC might actually be feasible, at least for LDC. LLVM already supports precise, moving collectors thanks to work done at Azul Systems, which is being used by an LLVM backend for CoreCLR that is intended to become production-quality at some point. GDC and DMD will need more work, though, unless one wants to use a shadow stack (don't – slow). |
|
Outdated and unlikely to perform reasonably fast, closing. |
This version of the precise GC saves type info in allocated memory.
The memory footprint of this version is different from the original GC (#1022) which allocated a pointer bitmap alongside every memory pool (one bit per size_t):
Some more notes:
The druntime benchmark suite runs a few percent slower than the non-precise version, unless it hits false pointers. That doesn't seem to happen very often for the test suite, though.