Skip to content
This repository has been archived by the owner on Oct 12, 2022. It is now read-only.

Rework of -profile=gc calculations #1846

Conversation

mihails-strasuns-sociomantic
Copy link
Contributor

Another attempt for #1806, now using
thread-local counter inside GC to circumvent sync issues.

@mihails-strasuns-sociomantic
Copy link
Contributor Author

(going to abuse auto-tester a bit)

@mihails-strasuns-sociomantic mihails-strasuns-sociomantic force-pushed the new-gc-profile branch 8 times, most recently from 68324f6 to 91b7fb9 Compare June 19, 2017 17:04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the small comments and the non-in-depth review of the code generating bits. LGTM. It is much more sensible than the previous implementation.

@@ -529,7 +529,7 @@ class ConservativeGC : GC
}

gcx.log_malloc(p, size);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some distant future it might be good to gcx.log_malloc alloc_size too...

src/rt/tracegc.d Outdated
extern (C) void* _d_arrayliteralTX(const TypeInfo ti, size_t length);
extern (C) void* _d_assocarrayliteralTX(const TypeInfo_AssociativeArray ti, void[] keys, void[] vals);
mixin(generateTraceWrappers());
//pragma(msg, generateTraceWrappers());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess commented pragmas should be gone.

src/rt/tracegc.d Outdated
accumulate(file, line, funcname, "closure", sz);
return _d_allocmemory(sz);
void foo(int x, double y) { }
static assert (Arguments!foo == "x, y, ");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I got completely lost with all the code generation. I will let someone else more used to black magic to review it.

@leandro-lucarella-sociomantic
Copy link
Contributor

Anyone interested in having a look at this? Maybe @MartinNowak ?

@mihails-strasuns-sociomantic
Copy link
Contributor Author

FYI: current state is "works like a charm for me but I haven't found time to rework how test suite runs so that it can pass on all platforms with more extensive tests added"

@leandro-lucarella-sociomantic
Copy link
Contributor

I think this is quite important, the current profiling information given by the GC is pretty much garbage, and giving wrong information could be more dangerous than not giving any information at all. So any help here is very appreciated.

Also, besides the currently not ideal testing situation and failure on some platforms, it would be nice to know if the general solution looks good for other druntime maintainers.

ping @dlang/team-druntime

@ibuclaw
Copy link
Member

ibuclaw commented Sep 15, 2017

High level design seems to reasonable to me.

@leandro-lucarella-sociomantic
Copy link
Contributor

I just had a look at the results for FreeBSD 32 and it's just an ordering issue. I guess something is wrong (or incomplete) with b64ac1f.

According to the auto-tester, the BSD 32 results are:

bytes allocated, allocations, type, function, file:line
            176	              1	immutable(char)[][int] D main src/profilegc.d:34
            128	              1	float[][] D main src/profilegc.d:18
            128	              1	int[][] D main src/profilegc.d:15
             64	              1	float[] D main src/profilegc.d:53
             64	              1	int[] D main src/profilegc.d:52
             16	              1	char[] D main src/profilegc.d:47
             16	              1	int[] D main src/profilegc.d:14
             16	              1	int[] D main src/profilegc.d:48
             16	              1	char[] D main src/profilegc.d:45
             16	              1	float D main src/profilegc.d:16
             16	              1	int D main src/profilegc.d:13
             16	              1	float[] D main src/profilegc.d:17
             16	              1	closure profilegc.main.foo src/profilegc.d:56
             16	              1	int[] D main src/profilegc.d:33
             16	              1	profilegc.main.C D main src/profilegc.d:12
             16	              1	wchar[] D main src/profilegc.d:46

And the one in the test case:

bytes allocated, allocations, type, function, file:line
            176	              1	immutable(char)[][int] D main src/profilegc.d:34
            128	              1	float[][] D main src/profilegc.d:18
            128	              1	int[][] D main src/profilegc.d:15
             64	              1	float[] D main src/profilegc.d:53
             64	              1	int[] D main src/profilegc.d:52
             16	              1	profilegc.main.C D main src/profilegc.d:12
             16	              1	char[] D main src/profilegc.d:45
             16	              1	char[] D main src/profilegc.d:47
             16	              1	closure profilegc.main.foo src/profilegc.d:56
             16	              1	float D main src/profilegc.d:16
             16	              1	float[] D main src/profilegc.d:17
             16	              1	int D main src/profilegc.d:13
             16	              1	int[] D main src/profilegc.d:14
             16	              1	int[] D main src/profilegc.d:33
             16	              1	int[] D main src/profilegc.d:48
             16	              1	wchar[] D main src/profilegc.d:46

Neither seem to be sorted by "name" for the 16 "allocated" values.

@leandro-lucarella-sociomantic
Copy link
Contributor

Just for the records, the correct order should be:

bytes allocated, allocations, type, function, file:line
            176	              1	immutable(char)[][int] D main src/profilegc.d:34
            128	              1	float[][] D main src/profilegc.d:18
            128	              1	int[][] D main src/profilegc.d:15
             64	              1	float[] D main src/profilegc.d:53
             64	              1	int[] D main src/profilegc.d:52
             16	              1	char[] D main src/profilegc.d:45
             16	              1	char[] D main src/profilegc.d:47
             16	              1	closure profilegc.main.foo src/profilegc.d:56
             16	              1	float D main src/profilegc.d:16
             16	              1	float[] D main src/profilegc.d:17
             16	              1	int D main src/profilegc.d:13
             16	              1	int[] D main src/profilegc.d:14
             16	              1	int[] D main src/profilegc.d:33
             16	              1	int[] D main src/profilegc.d:48
             16	              1	profilegc.main.C D main src/profilegc.d:12
             16	              1	wchar[] D main src/profilegc.d:46

@leandro-lucarella-sociomantic
Copy link
Contributor

Having a quick look, test/profile/myprofilegc.log.linux.32.exp, test/profile/myprofilegc.log.osx.32.exp results are also broken (bad ordering), so it seems like everything 32bit is broken.

But then I saw a pattern, so I think we have 2 problems here:

  1. All 32 bit results have the 16 1 profilegc.main.C D main src/profilegc.d:12 misplaced, I guess you fabricated these test results by changing the bytes in the 64 bit results without re-sorting, because in 64 bit that is 32 bytes, so it's above the 16 bytes records.

  2. Well, the FreeBSD actual results are wrongly sorted too, so even fixing the expected results won't fix the problem, I don't know why the sorting is not working for FreeBSD.

@leandro-lucarella-sociomantic
Copy link
Contributor

OK, checking:

  1. Linux_64_32 fails because the linux.32.exp file is wrong (as I described before, probably because of a copy&paste error).
  2. Linux_32 fails because the linux.32.exp file is wrong (as I described before, probably because of a copy&paste error).
  3. FreeBSD_32 fails because the freebsd.32.exp file is wrong AND because the actual results gotten by the test run are also wrong (bad ordering).
  4. FreeBSD_64_64 fails because the actual results gotten by the test run are also wrong (bad ordering). The line 16 1 closure profilegc.main.foo src/profilegc.d:56 is misplaced and the 16 1 float D main src/profilegc.d:16 too.
  5. Darwin_64_32 I don't know because I can't see the logs.
  6. Darwin_64_64 I don't know because I can't see the logs.

@rainers
Copy link
Member

rainers commented Oct 31, 2017

Instead of relying on the rather inaccurate callbacks by the compiler (e.g. anything allocated in the precompiled runtime will not be detected), I recently used a proxy GC that just forwards the interface calls to the actual GC. It records allocation information in malloc/calloc/qalloc calls (including the call stack which can then be used to determine the call location). This also avoids modifing the GC interface and implementation.

@leandro-lucarella-sociomantic
Copy link
Contributor

Instead of relying on the rather inaccurate callbacks by the compiler (e.g. anything allocated in the precompiled runtime will not be detected), I recently used a proxy GC that just forwards the interface calls to the actual GC. It records allocation information in malloc/calloc/qalloc calls (including the call stack which can then be used to determine the call location). This also avoids modifing the GC interface and implementation.

But can you really know exactly how much you allocated at that level of abstraction? How would you do that, just querying the GC to know what's the real capacity of the reserved block?

@rainers
Copy link
Member

rainers commented Nov 2, 2017

But can you really know exactly how much you allocated at that level of abstraction? How would you do that, just querying the GC to know what's the real capacity of the reserved block?

I guess you are referring to the difference between requested allocation size, returned size and actually used size, e.g. in array code in rt.lifetime. With intercepting the GC interface, you know the requested and allocated memory precisely (using qalloc), but can only estimate how much of it is in actual use.

The array interface functions yield some other information, but without knowing the exact implementation in rt.lifetime, calculating memory usage is quite some guesswork. For example, how do you interpret setlength? Will it allocate new memory, extend existing memory without allocation, or will it do nothing? How much memory will a new allocation actually use?

@mihails-strasuns-sociomantic
Copy link
Contributor Author

For our purpose it does not matter how much exactly is used - only impact it makes on total memory usage by GC. With current implementation setlength reallocation will be reported as full size allocation and extending in place - only for the extra size.

Uncoincidentally, setlengh case was the primary motivating reason why we had to reimplement this feature because we reset lengths of arrays to 0 all the time to reuse memory.

@mihails-strasuns-sociomantic
Copy link
Contributor Author

I kind of like the idea of using proxy GC instead of compiler support but don't want to lose type information current approach provides. Having to investigate backtrace for each logged allocation is a lot of work.

@rainers
Copy link
Member

rainers commented Nov 4, 2017

I kind of like the idea of using proxy GC instead of compiler support but don't want to lose type information current approach provides.

The TypeInfo parameter is supposed to give you that information, though it might not always be passed correctly now, e.g. when extending a block. The precise GC fixes that.

@dlang-bot dlang-bot added Needs Rebase needs a `git rebase` performed Needs Work and removed Needs Rebase needs a `git rebase` performed labels Jan 1, 2018
@mihails-strasuns-sociomantic
Copy link
Contributor Author

Huh, green

@mihails-strasuns-sociomantic
Copy link
Contributor Author

Ping. Any review/requests for this one? It is good to go from my side in its basic form.

Intermediate step for generating tracegc handlers automatically intended
to simplify reviewing by reducing size of each individual diff.
Replaces imprecise/wrong calculation based exclusively on checking
TypeInfo - now the data is retrieved directly from the GC and represents real
amount of bytes allocated.

Fix issue https://issues.dlang.org/show_bug.cgi?id=17294
Fix issue https://issues.dlang.org/show_bug.cgi?id=16280
Fix issue https://issues.dlang.org/show_bug.cgi?id=15481
Ordering by name as last measure means order of lines in the generate trace
will be reproducible, allowing for easy test cases.
Moves code snippet which refers to various language features that allocate to
actual test to ensure better functionality coverage.
Causes closure allocation despite being marked as scope
@mihails-strasuns-sociomantic
Copy link
Contributor Author

Right now biggest performance problem comes from how crazy expensive calling GC.stats is (because it calculates free/used data on the fly). Looking if it can be changed to pre-calculate stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM despite the few comments.

@@ -41,12 +42,11 @@ __gshared

extern (C) void profilegc_setlogfilename(string name)
{
logfilename = name;
logfilename = name ~ "\0";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unrelated to this commit.

@@ -143,8 +152,7 @@ shared static ~this()
if (counts.length)
{
qsort(counts.ptr, counts.length, Result.sizeof, &Result.qsort_cmp);

FILE* fp = logfilename.length == 0 ? stdout : fopen((logfilename ~ '\0').ptr, "w");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, is because of this. I would maybe make the commit message a bit more verbose that this commit is about avoid using language features that indirectly use the GC, not about removing some explicit recursive usage.

@@ -48,6 +48,9 @@ extern (C) void profilegc_setlogfilename(string name)
public void accumulate(string file, uint line, string funcname, string type,
size_t sz) @nogc
{
if (sz == 0)
return;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an optimization, right? Also some more verbose commit message could help.

@leandro-lucarella-sociomantic
Copy link
Contributor

Right now biggest performance problem comes from how crazy expensive calling GC.stats is (because it calculates free/used data on the fly). Looking if it can be changed to pre-calculate stuff.

Maybe do that in a separate PR to avoid this one to get stalled? Specially because that change has to be done very carefully, as one mistake might trigger wrong results and it's very hard to track/find out (speaking from experience).

@mihails-strasuns-sociomantic
Copy link
Contributor Author

#2164

@leandro-lucarella-sociomantic
Copy link
Contributor

ping @DmitryOlshansky @MartinNowak @andralex @rainers @wilzbach

Can you merge or give us some feedback if anything needs to be changed?

@leandro-lucarella-sociomantic
Copy link
Contributor

I would like to expand this PR much further to get a lot of other stats that are already kept by the GG (at least when profiling is enabled).

Since keeping some stats might be a bit expensive, in particular timing stats (we need to overcome not having GC.monitor()), which could do up to 5 syscalls per collection, I would also like to add a generic way to select which stats should be collected.

Conceptually, I think Stats/stats() should be renamed to Status/status(), as what you get from them are not really stats, but the current state of the GC (free/used memory). I know is probably not practical to make this name change, but I wanted to mention it anyway :)

New Stats/stats() should expose real stats, things that can be accumulated (and reset), so it should get an API with at least 4 basic operations: get(), reset(), enable() and disable(). The GC currently already get track of these:

__gshared Duration prepTime;
__gshared Duration markTime;
__gshared Duration sweepTime;
__gshared Duration recoverTime;
__gshared Duration maxPauseTime;
__gshared size_t numCollections;
__gshared size_t maxPoolMemory;

__gshared long numMallocs;
__gshared long numFrees;
__gshared long numReallocs;
__gshared long numExtends;
__gshared long numOthers;
__gshared long mallocTime;
__gshared long freeTime;
__gshared long reallocTime;
__gshared long extendTime;
__gshared long otherTime;
__gshared long lockTime;

(and this PR introduces size_t bytesAllocated)

Ideally we should be able to enable, disable and reset individual stats (and reset to a specific value rather than zero might be useful in some cases), so this can be done in a "only pay for what you need" style. But if it's too complicated, I guess we can also have a couple of categories (like "allocation" or "collection" or "counters" or "timing") or levels of detail (like "shallow" to only get the very basic stuff, like total collection time, and "deep" to get more fine-grained stats like "markTime" and other intermediate timings).

It might be useful to discuss this briefly at DConf to reduce the back and forth.

@MartinNowak @andralex @DmitryOlshansky @rainers

Copy link
Member

@andralex andralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leandro-lucarella-sociomantic LMK when this is ready - we can pull it now and expand it later.

A simple way to enable/disable collection of specific stats is to write -1 or some invalid value to the values that are not of interest, i.e. if numReallocs is set to -1 the reallocations won't be counted. This is of course of interest mostly for the more costly stats.

@@ -122,6 +122,9 @@ __gshared long extendTime;
__gshared long otherTime;
__gshared long lockTime;

// thread-local counter
size_t bytesAllocated;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ulong is better here; size_t is for sizes of objects in memory, whereas bytesAllocated is a tally. It could go over 4 GB on a 32-bit system.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copy&pasted what the GC is already currently gathering (when some profiling options are included). But yeah, we could review how these are being stored too (there is also the inconsistency of some timing being long and some Duration).

@RazvanN7
Copy link
Contributor

RazvanN7 commented Nov 2, 2021

This has been superseded by #2607

@RazvanN7 RazvanN7 closed this Nov 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Needs Rebase needs a `git rebase` performed stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants