Skip to content
This repository was archived by the owner on Oct 12, 2022. It is now read-only.

Conversation

@rainers
Copy link
Member

@rainers rainers commented Nov 14, 2014

This is the precise GC as presented at DConf 2013. It generates pointer bitmaps for each type T using RTInfo!T and stores a bitmap alongside pool memory to use for following only pointer references.

RTInfo generation isn't working properly at the moment (needs dlang/dmd#3958 or dlang/dmd#2480), but using the compiler default will just scan some types conservatively instead of precisely.

Some notes:

  • only the heap is scanned precisely, not stacks or the data/TLS segments
  • compilation is slower: I measured 3:46min for the "quick" test suite on Win32 instead of 3:26 for master (we might need a special trait to avoid a lot of CTFE)
  • execution is slower: the allocation heavy druntime benchmarks take 58sec instead of 51sec (unfortunately there are 2 list-benchmark which don't give reliable timings).
  • associative arrays need a dirty hack to support precise scanning of the key/value pairs in Entry: the TypeInfo for that is geneerated at runtime and stored in the Impl struct of each instance.
  • dynamic arrays need to "re-emplace" their type info pointer bitmaps for large arrays, because data is "sometimes" placed at an offset of 16 bytes

I'm opening this as a pull request to verify building on other platforms aswell as making it more publically visible and open for review/discussion.

I plan to create another version that stores type info per allocation block and use this during scanning. This can avoid the troubles for the arrays, but will not allow reassigning new types to partial memory blocks (e.g. when using std.emplace on fields of a class).

@rainers
Copy link
Member Author

rainers commented Nov 16, 2014

@rainers
Copy link
Member Author

rainers commented Dec 1, 2014

I've tried to reduce the additional compilation time by replacing some CTFE code with cachable template instances. For the dmd test suite this brought the gap dwon from 20 seconds to about 15 seconds.

Considering that the test suite spends most of the time not compiling, I benchmarked building Visual D: the debug version takes about 16% longer (22 seconds instead of 19), the release version is even worse with 23% (43 seconds instead 35). Please note that Visual D imports most of the Windows SDK and the full Visual Studio SDK, so it contains a lot of struct definitions and interfaces.

I guess for this to be acceptable we must make the pointer bitmap a compiler trait.

The binary size overhead of the release version was about 0.5% (2491kB instead of 2480kB).

@rainers
Copy link
Member Author

rainers commented Dec 5, 2014

I guess for this to be acceptable we must make the pointer bitmap a compiler trait.

I have implemented the trait here: dlang/dmd#4192
My tests don't show any measurable loss of compilation speed when using the trait, plus it avoids troubles when using Unqual on shared(T[N]).

@rainers
Copy link
Member Author

rainers commented Dec 5, 2014

I plan to create another version that stores type info per allocation block and use this during scanning.

This is now available here: #1057

@MartinNowak
Copy link
Member

unfortunately there are 2 list-benchmark which don't give reliable timings

Use avgtime and take the minima of 10 or 100 runs, that works because noise in benchmarks is mostly additive.

@rainers
Copy link
Member Author

rainers commented Dec 7, 2014

Use avgtime and take the minima of 10 or 100 runs, that works because noise in benchmarks is mostly additive.

I use it sometimes, but I'm unsure about a sole minimum being very representative (on a mobile processor where a lot of magic goes). If there are false pointers, they sometimes don't go away by just rerunning the same program. When comparing precise to imprecise scanning, you might even want to see the effects of false pointers ;-)

At the moment, I don't see these shaky results, though.

@rainers
Copy link
Member Author

rainers commented Jan 22, 2015

Updated to recent changes of the GC. With the getPointerBitmap trait, there is no longer a compile time penalty, but it's still 0-20% slower in the benchmarks, mostly during allocation. The benchmarks don't suffer much from false pointers, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Twice as big and can't be returned in registers, that looks expensive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to use a second stack for the pools in parallel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping for any function calls dealing with ScanRange to be inlined, though I didn't check yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you might try to pack the struct and change passing and return to Range + Pool*, using ref return for pop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the disassembly: it's all inlined AFAICT. So the struct needs 50% more memory (there is no padding), but there is probably not a lot we can do here. The non-precise version might use the Range struct though, so we could templatize ToScanStack on that type.
Indexing might be a little more expensive than with a struct size of 8/16, so we could also use a pointer to walk the stack.

@MartinNowak
Copy link
Member

I'll have a look at the benchmarks, we definitely need to get this fast enough.

src/gc/gc.d Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should move this test to before p1 is dereferenced. If something isn't a pointer we should perform as little work as possible. This loop is key to make this thing fast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous versions had some optimization to this respect, but I wanted to make minimal changes to begin with. The benchmarks don't show a large increase in scanning time, the preformance decrease is caused by setting the pointer bitmaps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it might be possible to make up for this by improving the marking time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried it: without further optimizations, moving the code above p = *p1 slows down marking by up to 10%. (Win32 on i7 mobile)

src/rt/aaA.d Outdated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unrelated, but the current version causes a range error when building druntime without -release.
The return value of _d_newarrayU is confusing, as it actually returns the length for T[], not void[]. It should rather just return void*

@MartinNowak
Copy link
Member

Delay until we have an idea how to get this fast, e.g. 2.068?

@rainers
Copy link
Member Author

rainers commented Feb 11, 2015

Delay until we have an idea how to get this fast, e.g. 2.068?

Yes. I'm also waiting for either dlang/dmd#2480 or dlang/dmd#3958 making some progress. This GC implementation still works with the incomplete RTInfo by making conservative assumptions, though.

@MartinNowak MartinNowak modified the milestones: 2.068, 2.067 Feb 12, 2015
@MartinNowak MartinNowak removed this from the 2.068 milestone Jun 20, 2015
@rainers rainers force-pushed the gc_precise_nov14 branch from 8b4cc04 to 7603ff8 Compare July 14, 2015 06:53
@rainers
Copy link
Member Author

rainers commented Jul 14, 2015

Rebased.

@rainers rainers force-pushed the gc_precise_nov14 branch 3 times, most recently from b4cf9fe to 24de144 Compare July 19, 2015 14:22
@DemiMarie
Copy link

What is the status of this? Will this be merged?

@rainers
Copy link
Member Author

rainers commented Nov 8, 2015

Rebased in preparation of the PR's first anniversary.

@PetarKirov
Copy link
Member

@rainers what's the status of this PR? Are there any serious showstoppers before this can be merged, or is it only the performance that you want to improve?

@rainers
Copy link
Member Author

rainers commented Nov 8, 2015

Are there any serious showstoppers before this can be merged, or is it only the performance that you want to improve?

AFAICT this is good enough to be included. @MartinNowak has some reservations regarding performance, but I don't see a faster solution in the near future. I tried some microtuning in the past, but that is rather frustrating when done against the dmd backend.

Before actual merging I'd make it opt-in rather than opt-out. Presice scanning is enabled in this PR so the auto-tester is actually running any of the new code.

What's missing is some documentation (some notes on what has to be done to get your manually managed memory scanned precisely), but I'd rather postpone writing it until this PR has a good chance of being merged.

@rainers
Copy link
Member Author

rainers commented Mar 23, 2016

Rebased. Needs dlang/dmd#5566 to build phobos unittests, though.

@DemiMarie
Copy link

I would love to see this. Next step is incremental concurrent collection.

@MartinNowak
Copy link
Member

Next step is incremental concurrent collection.

It is unlikely that we'll add write barriers, so we can't use classical techniques (e.g. generational) for incremental collection.

@rainers
Copy link
Member Author

rainers commented Aug 18, 2017

Closing in favor of #1603

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

GC garbage collector

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants