-
-
Notifications
You must be signed in to change notification settings - Fork 411
[WIP] precise GC as presented at DConf 2013 #1022
Conversation
|
The tests are currently not passing due to https://issues.dlang.org/show_bug.cgi?id=13738 (dlang/dmd#4147) and https://issues.dlang.org/show_bug.cgi?id=13736 (dlang/dmd#4143). |
338e663 to
9238b4a
Compare
|
I've tried to reduce the additional compilation time by replacing some CTFE code with cachable template instances. For the dmd test suite this brought the gap dwon from 20 seconds to about 15 seconds. Considering that the test suite spends most of the time not compiling, I benchmarked building Visual D: the debug version takes about 16% longer (22 seconds instead of 19), the release version is even worse with 23% (43 seconds instead 35). Please note that Visual D imports most of the Windows SDK and the full Visual Studio SDK, so it contains a lot of struct definitions and interfaces. I guess for this to be acceptable we must make the pointer bitmap a compiler trait. The binary size overhead of the release version was about 0.5% (2491kB instead of 2480kB). |
156506c to
e10438f
Compare
I have implemented the trait here: dlang/dmd#4192 |
5b953cf to
13cd2c7
Compare
This is now available here: #1057 |
Use avgtime and take the minima of 10 or 100 runs, that works because noise in benchmarks is mostly additive. |
I use it sometimes, but I'm unsure about a sole minimum being very representative (on a mobile processor where a lot of magic goes). If there are false pointers, they sometimes don't go away by just rerunning the same program. When comparing precise to imprecise scanning, you might even want to see the effects of false pointers ;-) At the moment, I don't see these shaky results, though. |
13cd2c7 to
88c4f09
Compare
|
Updated to recent changes of the GC. With the getPointerBitmap trait, there is no longer a compile time penalty, but it's still 0-20% slower in the benchmarks, mostly during allocation. The benchmarks don't suffer much from false pointers, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Twice as big and can't be returned in registers, that looks expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to use a second stack for the pools in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping for any function calls dealing with ScanRange to be inlined, though I didn't check yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you might try to pack the struct and change passing and return to Range + Pool*, using ref return for pop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the disassembly: it's all inlined AFAICT. So the struct needs 50% more memory (there is no padding), but there is probably not a lot we can do here. The non-precise version might use the Range struct though, so we could templatize ToScanStack on that type.
Indexing might be a little more expensive than with a struct size of 8/16, so we could also use a pointer to walk the stack.
|
I'll have a look at the benchmarks, we definitely need to get this fast enough. |
src/gc/gc.d
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should move this test to before p1 is dereferenced. If something isn't a pointer we should perform as little work as possible. This loop is key to make this thing fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous versions had some optimization to this respect, but I wanted to make minimal changes to begin with. The benchmarks don't show a large increase in scanning time, the preformance decrease is caused by setting the pointer bitmaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it might be possible to make up for this by improving the marking time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried it: without further optimizations, moving the code above p = *p1 slows down marking by up to 10%. (Win32 on i7 mobile)
f104470 to
a23de34
Compare
src/rt/aaA.d
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is unrelated, but the current version causes a range error when building druntime without -release.
The return value of _d_newarrayU is confusing, as it actually returns the length for T[], not void[]. It should rather just return void*
1c753e1 to
679b84f
Compare
|
Delay until we have an idea how to get this fast, e.g. 2.068? |
Yes. I'm also waiting for either dlang/dmd#2480 or dlang/dmd#3958 making some progress. This GC implementation still works with the incomplete RTInfo by making conservative assumptions, though. |
679b84f to
1444996
Compare
8b4cc04 to
7603ff8
Compare
|
Rebased. |
b4cf9fe to
24de144
Compare
|
What is the status of this? Will this be merged? |
24de144 to
1da68f3
Compare
|
Rebased in preparation of the PR's first anniversary. |
|
@rainers what's the status of this PR? Are there any serious showstoppers before this can be merged, or is it only the performance that you want to improve? |
AFAICT this is good enough to be included. @MartinNowak has some reservations regarding performance, but I don't see a faster solution in the near future. I tried some microtuning in the past, but that is rather frustrating when done against the dmd backend. Before actual merging I'd make it opt-in rather than opt-out. Presice scanning is enabled in this PR so the auto-tester is actually running any of the new code. What's missing is some documentation (some notes on what has to be done to get your manually managed memory scanned precisely), but I'd rather postpone writing it until this PR has a good chance of being merged. |
1da68f3 to
4505914
Compare
|
Rebased. Needs dlang/dmd#5566 to build phobos unittests, though. |
fix copyRangeRepeated optimization
4505914 to
a1c94c9
Compare
|
I would love to see this. Next step is incremental concurrent collection. |
It is unlikely that we'll add write barriers, so we can't use classical techniques (e.g. generational) for incremental collection. |
|
Closing in favor of #1603 |
This is the precise GC as presented at DConf 2013. It generates pointer bitmaps for each type T using RTInfo!T and stores a bitmap alongside pool memory to use for following only pointer references.
RTInfo generation isn't working properly at the moment (needs dlang/dmd#3958 or dlang/dmd#2480), but using the compiler default will just scan some types conservatively instead of precisely.
Some notes:
I'm opening this as a pull request to verify building on other platforms aswell as making it more publically visible and open for review/discussion.
I plan to create another version that stores type info per allocation block and use this during scanning. This can avoid the troubles for the arrays, but will not allow reassigning new types to partial memory blocks (e.g. when using std.emplace on fields of a class).