Skip to content
This repository was archived by the owner on Oct 12, 2022. It is now read-only.

Conversation

@rainers
Copy link
Member

@rainers rainers commented Dec 5, 2014

This version of the precise GC saves type info in allocated memory.

The memory footprint of this version is different from the original GC (#1022) which allocated a pointer bitmap alongside every memory pool (one bit per size_t):

  • one bit per possible allocation block (i.e. per 16 bytes for standard pools, per 4kB for large object pools)
  • if the allocation block is not NO_SCAN or must be scanned conservatively, one pointer at the end of the allocation block, i.e. only adds additional memory if the allocation size is just below a power of 2, but then the size doubles.

Some more notes:

  • The runtime overhead for allocation is less because no pointer bitmap has to be filled.
  • The mark phase has to go through an indirection to get the pointer bitmap of a type, but it now has the abitlity to only scan the memory actually requested and skip padded memory
  • The RTInfo can contain a delegate to call instead of standard marking. This is used for arrays to only scan the currently used area of the allocation block. It also allows to easily change memory interpretation for large/small arrays which caused troubles in the other GC. For AAs this currently only skips the hash value, there is too little type information available to scan key and value precisely.
  • compilation does not yet use the optimized trait (Enhancement: add trait getPointerBitmap to help precise scanning dmd#4192).

The druntime benchmark suite runs a few percent slower than the non-precise version, unless it hits false pointers. That doesn't seem to happen very often for the test suite, though.

@rainers rainers force-pushed the gc_precise_nov14_ti branch from 60ba003 to 08a647b Compare December 5, 2014 14:46
@etcimon
Copy link
Contributor

etcimon commented Dec 5, 2014

Do you have the shared type qualifier in RTInfo? I've been planning for a while to have an instance of GC that is locked for the shared allocations. I'd put every other type through a thread-local, lockless instance where only 1 thread is blocked during marking, would that seem like a good direction for the precise GC? It's very nice to see this implementation going forward, thanks :-p

@rainers
Copy link
Member Author

rainers commented Dec 5, 2014

Do you have the shared type qualifier in RTInfo?

No, the RTInfo is only generated for bare structs and classes. But "shared" could be available in the type info passed to the GC in most invocations.

would that seem like a good direction for the precise GC?

With the current way to work with shared (i.e. to cast it away while being protected by some mutex), I don't have a clue how this can work with thread local operations. Also note that immutable is implicitely shared.

@etcimon
Copy link
Contributor

etcimon commented Dec 5, 2014

With the current way to work with shared (i.e. to cast it away while being protected by some mutex), I don't have a clue how this can work with thread local operations.

It seems like the only important part would be when using new shared T, sending it to a shared Gcx shared_gc, whereas new T would send it to Gcx local_gc.

Also note that immutable is implicitely shared.

immutable objects would be unable to carry thread-local instances (as per the D syntax)... which seems to be the case already

Of course, there doesn't seem to be much to do other than branch off the allocate function, and change the Gcx to show that the local GC doesn't require locks, signals nor "stop the world"...

@rainers
Copy link
Member Author

rainers commented Dec 5, 2014

It seems like the only important part would be when using new shared T, sending it to a shared Gcx shared_gc, whereas new T would send it to Gcx local_gc.

How is this going to work? Consider a shared array shared(T[[]) locked by a mutex, then some factory function newing T without being aware of sharedness. The thread local T is added to the array casted to T[]. A thread local GC would not find that reference and free the instance of T.
It would require write barriers on every pointer write to transfer objects between thread local and global heaps.

@etcimon
Copy link
Contributor

etcimon commented Dec 5, 2014

How is this going to work? Consider a shared array shared(T[[]) locked by a mutex, then some factory function newing T without being aware of sharedness. The thread local T is added to the array casted to T[]. A thread local GC would not find that reference and free the instance of T.

That absolutely can't be permitted by the compiler. The compiler has to force all underlying types to be implicitely shared, or they're invalid, ie. typeid(shared(T[])) is typeid(shared(shared(T)[])). The shared type qualifier absolutely has to apply recursively implicitely. The only way you can move a T[] obj into shared T[] would be through a convenience function that copies the data into shared space, like obj.sdup() or makeShared(obj) for shared duplication. This would somewhat work like Isolated! and makeIsolated! from vibe.core.concurrency, but duplicating the data to the shared GC.

@yebblies
Copy link
Contributor

yebblies commented Dec 6, 2014

@etcimon This compiles with no errors, and is not a bug.

void main()
{
    shared x = new Object();
}

@etcimon
Copy link
Contributor

etcimon commented Dec 6, 2014

@etcimon This compiles with no errors, and is not a bug.

void main()
{
shared x = new Object();
}

It should be one. Leaving this semantic check out makes no sense. Storing a thread-local pointer on shared storage is like sending a local string pointer to a shared DB for later.
e.g.
string str = "hey";
db.save("str", &str); // stores "0xffff0e12"

You can jump through many hoops and loops to make it dereference properly, but you'll need some really tight coupling to keep it working. This is exactly why the GC is currently shared and locks/stops the world even for local allocations: someone believed local pointers should be allowed to be in shared storage.

@yebblies
Copy link
Contributor

yebblies commented Dec 6, 2014

It was designed for the single global GC that the language currently requires. Another example is this:

Object maker() pure
{
    return new Object();
}
void main()
{
    immutable x = maker();
}

The function that calls new has no idea that the allocation will become shared.

Neither of these examples involve casting or other explicitly 'unsafe' features. I don't see how a proposal for thread-local heaps that simply declares these use cases invalid and breaks existing code would ever be accepted.

@etcimon
Copy link
Contributor

etcimon commented Dec 6, 2014

I don't see how a proposal for thread-local heaps that simply declares these use cases invalid and breaks existing code would ever be accepted.

It doesn't need to break existing code. The dual tls/shared support in GC can be versioned or linked externally, and the libraries get an optional warning from the compiler to become progressively compatible. You should (ideally) know about your storage when you create your instance, or at least copy it there...

Object maker() pure
{
    return new Object();
}
void main()
{
    immutable x = maker().idup();
    shared y = maker().sdup()
}

There are countless performance issues that can be solved simply by avoiding the locking. D could beat Java in the benchmarks with this change alone.

http://forum.dlang.org/thread/jhbogjnxmcpjmemgaigs@forum.dlang.org

@rainers
Copy link
Member Author

rainers commented Dec 6, 2014

The tests fail because of https://issues.dlang.org/show_bug.cgi?id=8262

@rainers
Copy link
Member Author

rainers commented Dec 6, 2014

There are countless performance issues that can be solved simply by avoiding the locking.

I think most of the locking for allocations can be removed with a few CAS operations or just thread-local memory pools that are still scanned by the "global" garbage collection.

@etcimon
Copy link
Contributor

etcimon commented Dec 6, 2014

that are still scanned by the "global" garbage collection.

scanning involves locking as well if the garbage collection isn't thread local

@rainers
Copy link
Member Author

rainers commented Dec 6, 2014

scanning involves locking as well if the garbage collection isn't thread local

Only seldomly for a short period of time if scanning is done concurrently.

I agree that a thread local GC would be nice to have, but I don't see how it fits with the current language.

@etcimon
Copy link
Contributor

etcimon commented Dec 6, 2014

I agree that a thread local GC would be nice to have, but I don't see how it fits with the current language.

The best way to achieve a thread-local GC would be to improve and enforce shared-correctness in Phobos/druntime (at first). We need to start considering shared as a heap storage attribute as well, for consistency. An optional compiler warning (through a flag) would be a start.

If even a 30% speedup is possible down the line, it's worth it. The more threads, the more improvements. I have over 90k lines of D code to maintain and I'm ready to put in the work for this on druntime as well.

@MartinNowak
Copy link
Member

I actually implemented a bloom filter at some point and tried to use that for eliminate false pointers, it was much slower and only found very few false pointers in the gc benchmark suite. Almost all false pointers were already catched by the p >= minAddr && p < maxAddr guard. Sure false pointers are a slightly bigger problem on 32-bit systems, but now that even mobile phones start to roll out with 64-bit CPUs I wonder how important that still is.

The other thing you can do with a precise GC is to skip non-pointer field in an allocated object.
But once you already loaded the cacheline each L1 lookup only costs 4 cycles, so it boils down to whether processing the bitmaps is faster than simply loading and checking all the 8-byte aligned pointers. That might indeed be the case when you have many objects that contain only a few pointers (thus can't be NO_SCAN) but also store a lot of POD data, e.g. an RBTree with big values.

@rainers
Copy link
Member Author

rainers commented Dec 7, 2014

Almost all false pointers were already catched by the p >= minAddr && p < maxAddr guard.

That's also my experience, but it depends a lot on how much memory is actually used and how much random data is scanned.

Sure false pointers are a slightly bigger problem on 32-bit systems, but now that even mobile phones start to roll out with 64-bit CPUs I wonder how important that still is.

Considering that 32-bit programs are quite a bit faster (e.g. the benchmarks take 43 sec instead of 52 sec for 64-bit on my system), people might still prefer it sometimes.

The other thing you can do with a precise GC is to skip non-pointer field in an allocated object.

Knowing the exact size of an object or array also can help a lot. That can avoid scanning up to half of the allocation block. We might also skip zeroing that area during allocation.

But once you already loaded the cacheline each L1 lookup only costs 4 cycles, so it boils down to whether processing the bitmaps is faster than simply loading and checking all the 8-byte aligned pointers.

Yeah, it's pretty hard to predict what version is better. It's also system dependent. The TypeInfo/RTInfo is probably used quite often so it should also be in L1/L2 cache usually.

@MartinNowak
Copy link
Member

Yeah, it's pretty hard to predict what version is better. It's also system dependent. The TypeInfo/RTInfo is probably used quite often so it should also be in L1/L2 cache usually.

It will be crucial to store the bitmaps very efficiently.
To deduplicate them, you could mangle them as _d_bitmap_hexorbase64 or somehow pass them as value parameter through a template.
There exist many "real-time" compression algorithms, where real-time here means, it's faster to decompress than to read the uncompressed data from memory (see https://github.com/MartinNowak/d-blosc). They wouldn't really work for random access patterns though, but we should try simple stuff like varints.

@MartinNowak
Copy link
Member

It might also be interesting to add some statistic option to the GC that reports what kind of data is allocated (currently classes, NO_SCAN and the rest, later also structs and arrays) and the mean and standard deviation for the percentage of pointers in each allocation (grouped by kind and size).
Then we could ask people to run this with their D projects and hopefully collect some relevant data for making better optimization decisions.

@rainers
Copy link
Member Author

rainers commented Dec 7, 2014

It will be crucial to store the bitmaps very efficiently.

I'm not so sure. Most of the bitmaps will fit into a cache line anyway. Maybe it's even better to store a series of bytes with alternating "number of 1" and "number of 0" so you can jump easily over areas without pointers.

Alternatively, we might just check the bitmap after the if (p >= minAddr && p < maxAddr) check succeeds.

To deduplicate them, you could mangle them as _d_bitmap_hexorbase64 or somehow pass them as value parameter through a template.

That could also be the task of the linker (identical comdat folding on the RTInfo).

@MartinNowak
Copy link
Member

I'm not so sure. Most of the bitmaps will fit into a cache line anyway.

Right, but compressed they will fit in fewer cachelines, so you'll cause less cache pollution during GC scanning.

Maybe it's even better to store a series of bytes with alternating "number of 1" and "number of 0" so you can jump easily over areas without pointers.

Will be tricky to come up with an efficient RLE, anyhow that's fairly advanced and should only be done if necessary or as an optional optimization step.

To deduplicate them, you could mangle them as _d_bitmap_hexorbase64 or somehow pass them as value parameter through a template.

That could also be the task of the linker (identical comdat folding on the RTInfo).

The deduplication is mandatory IMO and ICF only works with Microsoft's linker (currently disabled due to Issue 10664), so please handle this explicitly.

@MartinNowak
Copy link
Member

I'm not so sure. Most of the bitmaps will fit into a cache line anyway. Maybe it's even better to store a series of bytes with alternating "number of 1" and "number of 0" so you can jump easily over areas without pointers.

See p. 8-10 http://www.slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them, that's fairly trival, efficient and should be close to optimal (though we don't know the distribution).
I think we should implement this, maybe first getting rid of the related CTFE allocations in the compiler.
Only needed when the bitmap would get bigger than 1 byte though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32-byte for each type? The bitmap will only be 1-byte in many cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this size_t and not ubyte? How many different mark function do you have?
Maybe it's better to turn this into a ubyte* to [MarkType, length (varint), bitmap bytes] or [MarkType, bitmap bytes, sentinel].
We currently have about 8000 types in the unittest libphobos and 2000 in the release one, that's quite a number of relocations to add. Would be great to use non-global symbols here (visibility hidden) to avoid relocation during loading, but that's currently not possible in D.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32-byte for each type? The bitmap will only be 1-byte in many cases.

Even if it the bitmap needs only 1 byte, alignment will make it 16 bytes. I chose size_t[] because of the handling in markPrecise, but that might could be changed depending on what's faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this size_t and not ubyte? How many different mark function do you have?

There are currently 3: default, dynamic array, assoc array. But as there is no "emplace" in this version of the GC, I'd like to allow the user to specify a mark delegate to scan void[N] fields precisely according to emplaced types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ubyte has an alignment of 1-byte.
Well then you could have a forth type meaning user callback, that would be the only one requiring alignment.

@rainers
Copy link
Member Author

rainers commented Dec 7, 2014

See p. 8-10 http://www.slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them, that's fairly trival, efficient and should be close to optimal (though we don't know the distribution).
I think we should implement this, maybe first getting rid of the related CTFE allocations in the compiler.

One other option would be to generate mark function per type by string mixins. That's an idea that @andralex wanted to persue. It will wreck compilation speed, though.

I currently suspect most of the performance being lost in the benchmark by a lot of indirections when marking arrays of strings that iterates over small types (char[] = { length, ptr }).

@rainers
Copy link
Member Author

rainers commented Dec 12, 2014

While working on the (array of) struct destructors I realized that the array marking needs at least 2 versions: one for the array layout in lifetime.d, and one for plain arrays as created by std.array.Appender.
I'm not sure how this can be achieved by a single type info for T[], though.

Maybe RTInfo!T should be evaluated in the scope of the type to allow specialization. That could open some options to setup specific mark functions without implementing everything in the runtime.

@schveiguy
Copy link
Member

why should the runtime be concerned about a phobos type's misuse of the runtime?

@rainers
Copy link
Member Author

rainers commented Dec 12, 2014

why should the runtime be concerned about a phobos type's misuse of the runtime?

If it is a common idiom (e.g. Appender) to allocate arrays of objects like this, we need a way to specify how to scan it. It might also be C malloced memory added as a gc-range.

Even if I wouldn't care if Appender is dumped, I still think that if precise scanning is done through the TypeInfo, a struct/class must have the ability to specify how a class emplaced into a field has to be scanned. The other precise GC does this by modifying the pointer bitmap stored by the GC, this version needs to specify a mark callback function.

@schveiguy
Copy link
Member

OK, as long as the marking function is defined and maintained in Phobos, that sounds correct.

@Orvid
Copy link
Contributor

Orvid commented Jan 13, 2015

As far as bitmap compression goes, I will say that there is nothing as far as I know in the spec that prevents the compiler from re-ordering the fields of a non-decorated (non- extern(C|C++|Windows|whatever)) type.

@MartinNowak MartinNowak added the GC garbage collector label Mar 7, 2015
@MartinNowak
Copy link
Member

Because the discussion came up again, a summary of my points:

  • False pointers are not an issue on 64-bit platforms.
  • Precise GC isn't per se useful unless you figure out how to use it to improve marking speed.
    Take ubyte[][] as an extreme example, even though you only have to look at every 2nd word, you'll have a hard time to reach the same speed as a linear scan when encoding the mark pattern as metadata.
  • Bitpatterns must be deduplicated and stored in a dense format, b/c marking is already completely bound by the speed of the memory bus.
  • Indirect functions calls (type-specific mark functions) will cause plenty of code cache misses (b/c the jumps aren't predictable) and increase mark latency.
  • Even w/ bigger types (more than the 64bytes of one cacheline) the prefetcher might get the next cacheline before you are able to figure out that you don't need to scan it.
  • For types w/ very big internal buffers (ubyte[4096] buf) it'd be better to teach people, that using an external ubyte[] buf = new ubyte[](4096) is faster for the GC to mark.
  • We should still invest some more time to separate data with points from NO_SCAN data, e.g. by separating data sections (see https://trello.com/c/Dsdxd1r3/56-ptr-data-section).

I think there are lower hanging fruits to speed up the GC, e.g. background/parallel finalization, vectorized (SIMD) loop in mark.

@andralex
Copy link
Member

andralex commented Apr 3, 2016

A few answers interspersed with as many questions (I'm not very familiar with the current implementation).

Because the discussion came up again, a summary of my points:

  • False pointers are not an issue on 64-bit platforms.

Indeed I could not find much information about that. Searching for quoted "false pointers" yields little data (a subset of which points to discussions within the D community), because so few languages have them! I fear this might be a political issue - we're the only language that sticks with conservative GC when all other languages and systems use precise GCs and build upon them.

  • Precise GC isn't per se useful unless you figure out how to use it to improve marking speed.
    Take ubyte[][] as an extreme example, even though you only have to look at every 2nd word, you'll have a hard time to reach the same speed as a linear scan when encoding the mark pattern as metadata.

A more interesting case is structures with pointers on one cache line and no pointers on the next one. Consider:

struct Customer {
    string lastname;
    string firstname;
    string address;
    string zipcode;
    int id;   
    char[8] ssn;
    ...
}

An array of Customer has 64 bytes worth of pointers per item and then some non-pointer data, say 64 bytes or more. In that case only a fraction of the cache lines need to be looked at. Would this save on scanning? How does that interact with prefetching? I'm not sure we can just make the blanket assumption that in all cases brute linear scanning wins.

Another question: what is the cost per word to decide whether it is a real pointer or a false pointer?

  • Bitpatterns must be deduplicated and stored in a dense format, b/c marking is already completely bound by the speed of the memory bus.

I agree bit patterns are not the best way to go about characterizing data shapes. We've discussed in the past the variant with generating a mark function per allocated type. So instead of bit patterns we have specialized code that knows how to scans types (and arrays thereof) transitively.

  • Indirect functions calls (type-specific mark functions) will cause plenty of code cache misses (b/c the jumps aren't predictable) and increase mark latency.

In the scheme sketched above, the only indirect calls are for roots. All others are direct calls because the functions are generated with full type information.

  • Even w/ bigger types (more than the 64bytes of one cacheline) the prefetcher might get the next cacheline before you are able to figure out that you don't need to scan it.

At some point surely prefetching will become slower than loading a fraction of the pages. I think we'd need to test this. Until then I think it's difficult to base decisions on it.

  • For types w/ very big internal buffers (ubyte[4096] buf) it'd be better to teach people, that using an external ubyte[] buf = new ubyte[](4096) is faster for the GC to mark.

Historically framing things as education issues are problematic. Also, often data schemata grow by accretion - one member here, one member there, and each member grows in size with adding features to the code. I don't think we can count this as a strong argument.

That'd be great indeed, but has a hard limit on what you can do. One theme in precise GC is that there are more levers to improving it than conservative.

I think there are lower hanging fruits to speed up the GC, e.g. background/parallel finalization, vectorized (SIMD) loop in mark.

For the most part these don't compete with precision and would improve either approach.

@LightBender
Copy link
Contributor

Erm, the point about false pointers is factually incorrect. While yes, in theory, a 64-bit address space should "solve" the false pointer problem, the reality is, as always, a little more complicated. There was a recent thread on the forum by someone doing big-data work. He ran into a false pointers problem in a 64-bit app. http://forum.dlang.org/thread/mafpzsuiuhzlriixymvy@forum.dlang.org

My guess is that the problem is that while the OS allocator could randomly map addresses inside the virtual address space, it is not. I strongly suspect that the behavior is more along the lines of: find free space in the physical memory then map the address to the next available block of contiguous free space in the virtual memory space. This would necessarily involve mostly filing up the 32-bit sized space before spilling over into 64-bit space.

As we move into an era of larger data-sets and the need to accurately work on them, I fear that the mantra of "64-bit solves false pointers!" is going to be proven false at an ever increasing rate.

@DemiMarie
Copy link

A MAJOR win for precise or mostly-precise GC is the ability to compact all (resp. most) of the heap. This eliminates (resp. dramatically reduces) fragmentation.

@DemiMarie
Copy link

Also, the precise GC has already been written – it is probably a good idea to use it!

@rainers
Copy link
Member Author

rainers commented Apr 3, 2016

False pointers are not an issue on 64-bit platforms.

I don't really buy this. Just adding std.datetime to the parsed files of the vdparser test shows a varying process memory usage of 400 to 900 MB for the x64 benchmark under win64 (address randomization is likely to cause more trouble than the usual addresses under linux). That doesn't happen with the precise GC.

I agree with most of what Andrei said, but this part:

So instead of bit patterns we have specialized code that knows how to scans types (and arrays thereof) transitively. [...]
In the scheme sketched above, the only indirect calls are for roots. All others are direct calls because the functions are generated with full type information.

I don't think transitivity works in D: it moves the memory type description from the allocation to the referrer, but that information is incomplete (a class reference doesn't know about derived classes) and ambiguous (multiple referers might use different pointer types, especially in system code). You have to keep the type information for every block anyway (in case it becomes a root), but tracking whether a transitive scan covered the actual allocation type seems duplicate effort.

Please note that the implementation in this PR is rather slow due to the number of indirections. Converting the scan function into CTFE generated code (making compilation even slower than the bitmap generation used to be before converting it into a trait) just reduces the number of indirections by one.

I still prefer #1022, but it might need some optimizations for saving the pointer bitmap during allocation. Tuning these with dmd's weak optimizations was frustrating, though. Large arrays might use a simple scheme to save the pattern just once.

@andralex
Copy link
Member

andralex commented Apr 3, 2016

I don't think transitivity works in D: it moves the memory type description from the allocation to the referrer, but that information is incomplete (a class reference doesn't know about derived classes) and ambiguous (multiple referers might use different pointer types, especially in system code).

Interesting re class references - so for classes you'd need an indirect call (including a hidden virtual method). Regarding storing different pointer types, what cases do you have in mind? For void* an indirect call would be needed, but otherwise I think we can assume fairly strict typing rules.

You have to keep the type information for every block anyway (in case it becomes a root), but tracking whether a transitive scan covered the actual allocation type seems duplicate effort.

Please note that the implementation in this PR is rather slow due to the number of indirections. Converting the scan function into CTFE generated code (making compilation even slower than the bitmap generation used to be before converting it into a trait) just reduces the number of indirections by one.

Couldn't that be significant?

@rainers
Copy link
Member Author

rainers commented Apr 4, 2016

Regarding storing different pointer types, what cases do you have in mind? For void* an indirect call would be needed, but otherwise I think we can assume fairly strict typing rules.

The void* pointer was my main concern, too. Unions of different pointer types are similar. Most other pointer casts are just temporary and might only become a problem if the stack is scanned more precisely.

Other problems:

  • how do you deal with internal pointers? With pointers to fields you cannot be sure that the referring type actually describes the full object.
  • similar: slices might only refer to a part of an array, how do you keep track of which parts are already scanned?
  • the AA implementation stores two objects (key + value) into one allocation. Currently, there is no proper type info for this pair, that's why it has to be dealt with at runtime.

@rainers
Copy link
Member Author

rainers commented Apr 4, 2016

just reduces the number of indirections by one.
Couldn't that be significant?

I'm not sure. IIRC it was also painful to have to call back into the GC to follow references. It could help if some of this can be inlined, but it would disallow swapping the GC without recompilation.

If there is interest, I can try to rebase this PR for some experiments.

@MartinNowak
Copy link
Member

There is a huge difference between false pointers (as in random number pinning an array) and stale pointers on the stack/in registers. The latter being a real problem even for 64-bit, but this PR doesn't solve it.

Another question: what is the cost per word to decide whether it is a real pointer or a false pointer?

Mostly a single comparison, https://github.com/D-Programming-Language/druntime/blob/a83323439398a31126a33043bdd6c578bbc46433/src/gc/gc.d#L2022-L2027.
The percentage of false pointers passing p >= minAddr && p < maxAddr was way below 1% when I tried to add a bloom filter to reduce false positives.
The 2nd hurdle (findPool) takes just a few cycles to find the exact pool.

I agree bit patterns are not the best way to go about characterizing data shapes. We've discussed in the past the variant with generating a mark function per allocated type. So instead of bit patterns we have specialized code that knows how to scans types (and arrays thereof) transitively.

Well code will require even more memory than bit patterns, and required the same deduplication.
Given that pointers are 8-byte aligned bit pattern requires only 1-byte per 64-byte cacheline.

For the most part these don't compete with precision and would improve either approach.

Yes, but we have limited resources, with which this does compete.

My guess is ... necessarily involve mostly filing up the 32-bit sized space before spilling over into 64-bit space.

Let's please keep guesswork out of discussions, it's just wasting the time of everyone involved.

@MartinNowak
Copy link
Member

A MAJOR win for precise or mostly-precise GC is the ability to compact all (resp. most) of the heap. This eliminates (resp. dramatically reduces) fragmentation.

Yes, compaction is interesting. And fully solving the problem of precise register/stack marking makes this much more interesting. Just precise heap/root marking alone isn't that interesting.

@MartinNowak
Copy link
Member

An array of Customer has 64 bytes worth of pointers per item and then some non-pointer data, say 64 bytes or more. In that case only a fraction of the cache lines need to be looked at. Would this save on scanning? How does that interact with prefetching? I'm not sure we can just make the blanket assumption that in all cases brute linear scanning wins.

We certainly can't, but you can easily craft a benchmark to test this (https://github.com/D-Programming-Language/druntime/blob/a83323439398a31126a33043bdd6c578bbc46433/benchmark/gcbench/tree1.d). Also the question is not only what you win for some types, but how much you'd loose for others.

@DemiMarie
Copy link

@MartinNowak JavaScriptCore and Steel Bank Common Lisp use a mostly-copying collector, where objects that are pointed to only by other heap objects (the vast majority) can be moved, while objects that are pointed to by registers or the stack are pinned. This provides good performance in practice. So precise heap/root marking is already a big win, even if the stack and roots are still marked conservatively. This is also shown by the dramatic reduction in memory leaks on 32-bit systems.

Also, I think that fully precise GC might actually be feasible, at least for LDC. LLVM already supports precise, moving collectors thanks to work done at Azul Systems, which is being used by an LLVM backend for CoreCLR that is intended to become production-quality at some point. GDC and DMD will need more work, though, unless one wants to use a shadow stack (don't – slow).

@dlang-bot dlang-bot added Needs Rebase needs a `git rebase` performed stalled Needs Work labels Jan 1, 2018
@rainers
Copy link
Member Author

rainers commented Sep 28, 2019

Outdated and unlikely to perform reasonably fast, closing.

@rainers rainers closed this Sep 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

GC garbage collector Needs Rebase needs a `git rebase` performed Needs Work stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants