Add rudimentary garbage collection to graph #19513

jriddy · 2023-07-23T16:44:02Z

Attempt to address #7675 by adding some rudimentary garbage collection to the graph.

Re-opening of #19070 on a new branch to take advantage of caching in CI.

Pushing this now to know if this is directionally or generally correct. This is an attempt at solving #7675. I'm aware there needs to be tests on the graphs, but I'm still trying to load the data structure semantics into my head to know how to construct a test. In the meanwhile, I'd like some feedback to see if this is going in the right direction or if I'm totally off base here. Also feel free to correct me or push me to better practices on my Rust.

jriddy · 2023-07-23T17:50:14Z

@stuhood Modified this a bit, but I'm worried I may have some logic totally wrong here.

To get faster results, locally, I've added a much more aggressive threshold, but this results in tons of rule retries (which I think makes sense), since nodes are being removed from the graph quite quickly. And it only seems to result in a very paltry amount of memory being relinquished (on the order of 1-3 MB per cycle, if any at all), which is disheartening.

Filesystem changed during run: retrying `@rule(pants.backend.python.dependency_inference.rules.infer_python_init_dependencies())`

I guess it might be too much to chew off here to try to do GC within a short time frame, even though I think that's going to be necessary to get Pants to be performant on moderate-to-large size repos. But I'm okay with starting with a smaller goal if that gets us there.

I think I need to find a way to test this at a smaller level (and load a better idea of the graph into my head). So this is my prospective TODO here:

Test that the walk removes nodes
Test that Entry::clear actually clears node memory
Test that we update the age of the correct nodes
Parametrize the age threshold and cycle time (make global config parameters?)

Any guidance on where I could start with these? I'm a bit lost

huonw · 2023-08-05T05:19:03Z

(Just acknowledging the review request: I don't yet have the domain knowledge to provide guidance on your questions. Sorry!)

stuhood · 2023-08-15T21:09:31Z

but this results in tons of rule retries (which I think makes sense), since nodes are being removed from the graph quite quickly
Filesystem changed during run: retrying `@rule(pants.backend.python.dependency_inference.rules.infer_python_init_dependencies())`

Looking at the code that you have, what we're actually doing is clearing the node, rather than invalidating it. If a node is not currently running, that will not actually cause the dependents of the node to re-run (because it is an operation on an individual node, unlike the invalidate operation).

So I suspect that what is happening is that you are clearing nodes while they are running. You could add a check that avoids invalidating nodes while they are running, but I expect that as mentioned in #19070 (comment) , that would go hand in hand with adjusting the GC calculation to be based on the size of the node value (since there is no point clearing something that we don't actually have a value/size for).

I think I need to find a way to test this at a smaller level (and load a better idea of the graph into my head). So this is my prospective TODO here:

Test that the walk removes nodes

Test that Entry::clear actually clears node memory

Test that we update the age of the correct nodes

Parametrize the age threshold and cycle time (make global config parameters?)

I think that the guidance from step 6 here is probably still accurate, in that returning a count of cleared nodes from the method which implements the meat of garbage collection, and then testing a few different situations like "expecting nothing to be cleared" and "expecting 1 thing to be cleared", etc ... would be sufficient. To accomplish that, you will likely also need to go ahead and do the parametrization (then you can sleep between creating one set of nodes and another so that they end up with different access times).

jriddy · 2023-08-15T21:36:59Z

So I suspect that what is happening is that you are clearing nodes while they are running. You could add a check that avoids invalidating nodes while they are running, but I expect that as mentioned in #19070 (comment) , that would go hand in hand with adjusting the GC calculation to be based on the size of the node value (since there is no point clearing something that we don't actually have a value/size for).

Thanks for the feedback.

This is more in line what I've thought after spending more time diving into how the graph works (mostly by just going through its existing tests).

Based on this do you think this time-based variant is even worth pursuing? Or can we just add some additional check to see if the node has a value without worrying about its size?

stuhood · 2023-08-16T23:56:13Z

Hm, actually: thinking about this a bit more, it's unusual that anything that is Running would actually not be reachable from a root. During a run, everything should be reachable from the current roots.

So it could be that you actually want to be storing last access time on all nodes, and then rather than worrying about roots at all, you would garbage collect nodes which hadn't been accessed in a while (...and which weren't running). That would allow you to safely garbage collect data that was idle, even during a run.

jriddy · 2023-08-17T00:08:02Z

That makes sense. I need to figure out why I was clearing anything at all in this case. Like I said, I think I need to write more granular test cases to validate my assumptions if I want to tackle this. A question, as I’m still trying to fit this crate in my head: I think that roots are the nodes that are requested externally from outside the graph, and their dependents are what they add to be the graph while running. Is this correct? In a typical pants run, what roots might get added?

…

On Wed, Aug 16, 2023 at 19:56 Stu Hood ***@***.***> wrote: Hm, actually: thinking about this a bit more, it's unusual that anything that is Running would actually *not* be reachable from a root. During a run, everything should be reachable from the current roots. So it could be that you actually want to be storing last access time on *all* nodes, and then rather than worrying about roots at all, you would garbage collect nodes which hadn't been accessed in a while (...and which weren't running). That would allow you to safely garbage collect data that was idle, even during a run. — Reply to this email directly, view it on GitHub <#19513 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAH5UH6V7SBTX2J6EUOWHRTXVVM2RANCNFSM6AAAAAA2UTZO7E> . You are receiving this because you authored the thread.Message ID: ***@***.***>

jriddy added 3 commits July 23, 2023 11:45

engine: change interface of Entry.clear

c59ebe1

graph: various pr fixes

d101b71

jriddy mentioned this pull request Jul 23, 2023

graph: first pass attempt at garbage collection #19070

Closed

jriddy changed the title ~~Jriddy gc graph 7675~~ Add rudimentary garbage collection to graph Jul 23, 2023

jriddy requested a review from stuhood July 23, 2023 17:26

jriddy requested a review from huonw July 25, 2023 20:20

jriddy added the category:performance label Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rudimentary garbage collection to graph #19513

Add rudimentary garbage collection to graph #19513

jriddy commented Jul 23, 2023 •

edited

Loading

jriddy commented Jul 23, 2023

huonw commented Aug 5, 2023

stuhood commented Aug 15, 2023

jriddy commented Aug 15, 2023

stuhood commented Aug 16, 2023

jriddy commented Aug 17, 2023 via email

Add rudimentary garbage collection to graph #19513

Are you sure you want to change the base?

Add rudimentary garbage collection to graph #19513

Conversation

jriddy commented Jul 23, 2023 • edited Loading

jriddy commented Jul 23, 2023

huonw commented Aug 5, 2023

stuhood commented Aug 15, 2023

jriddy commented Aug 15, 2023

stuhood commented Aug 16, 2023

jriddy commented Aug 17, 2023 via email

jriddy commented Jul 23, 2023 •

edited

Loading