Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rudimentary garbage collection to graph #19513

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

jriddy
Copy link
Contributor

@jriddy jriddy commented Jul 23, 2023

Attempt to address #7675 by adding some rudimentary garbage collection to the graph.

Re-opening of #19070 on a new branch to take advantage of caching in CI.

Pushing this now to know if this is directionally or generally correct.

This is an attempt at solving #7675.  I'm aware there needs to be tests
on the graphs, but I'm still trying to load the data structure semantics
into my head to know how to construct a test.  In the meanwhile, I'd like
some feedback to see if this is going in the right direction or if I'm
totally off base here.

Also feel free to correct me or push me to better practices on my Rust.
@jriddy jriddy changed the title Jriddy gc graph 7675 Add rudimentary garbage collection to graph Jul 23, 2023
@jriddy jriddy requested a review from stuhood July 23, 2023 17:26
@jriddy
Copy link
Contributor Author

jriddy commented Jul 23, 2023

@stuhood Modified this a bit, but I'm worried I may have some logic totally wrong here.

To get faster results, locally, I've added a much more aggressive threshold, but this results in tons of rule retries (which I think makes sense), since nodes are being removed from the graph quite quickly. And it only seems to result in a very paltry amount of memory being relinquished (on the order of 1-3 MB per cycle, if any at all), which is disheartening.

Filesystem changed during run: retrying `@rule(pants.backend.python.dependency_inference.rules.infer_python_init_dependencies())`

I guess it might be too much to chew off here to try to do GC within a short time frame, even though I think that's going to be necessary to get Pants to be performant on moderate-to-large size repos. But I'm okay with starting with a smaller goal if that gets us there.

I think I need to find a way to test this at a smaller level (and load a better idea of the graph into my head). So this is my prospective TODO here:

  • Test that the walk removes nodes
  • Test that Entry::clear actually clears node memory
  • Test that we update the age of the correct nodes
  • Parametrize the age threshold and cycle time (make global config parameters?)

Any guidance on where I could start with these? I'm a bit lost

@jriddy jriddy requested a review from huonw July 25, 2023 20:20
@huonw
Copy link
Contributor

huonw commented Aug 5, 2023

(Just acknowledging the review request: I don't yet have the domain knowledge to provide guidance on your questions. Sorry!)

@stuhood
Copy link
Member

stuhood commented Aug 15, 2023

but this results in tons of rule retries (which I think makes sense), since nodes are being removed from the graph quite quickly

Filesystem changed during run: retrying `@rule(pants.backend.python.dependency_inference.rules.infer_python_init_dependencies())`

Looking at the code that you have, what we're actually doing is clearing the node, rather than invalidating it. If a node is not currently running, that will not actually cause the dependents of the node to re-run (because it is an operation on an individual node, unlike the invalidate operation).

So I suspect that what is happening is that you are clearing nodes while they are running. You could add a check that avoids invalidating nodes while they are running, but I expect that as mentioned in #19070 (comment) , that would go hand in hand with adjusting the GC calculation to be based on the size of the node value (since there is no point clearing something that we don't actually have a value/size for).

I think I need to find a way to test this at a smaller level (and load a better idea of the graph into my head). So this is my prospective TODO here:

  • Test that the walk removes nodes
  • Test that Entry::clear actually clears node memory
  • Test that we update the age of the correct nodes
  • Parametrize the age threshold and cycle time (make global config parameters?)

I think that the guidance from step 6 here is probably still accurate, in that returning a count of cleared nodes from the method which implements the meat of garbage collection, and then testing a few different situations like "expecting nothing to be cleared" and "expecting 1 thing to be cleared", etc ... would be sufficient. To accomplish that, you will likely also need to go ahead and do the parametrization (then you can sleep between creating one set of nodes and another so that they end up with different access times).

@jriddy
Copy link
Contributor Author

jriddy commented Aug 15, 2023

So I suspect that what is happening is that you are clearing nodes while they are running. You could add a check that avoids invalidating nodes while they are running, but I expect that as mentioned in #19070 (comment) , that would go hand in hand with adjusting the GC calculation to be based on the size of the node value (since there is no point clearing something that we don't actually have a value/size for).

Thanks for the feedback.

This is more in line what I've thought after spending more time diving into how the graph works (mostly by just going through its existing tests).

Based on this do you think this time-based variant is even worth pursuing? Or can we just add some additional check to see if the node has a value without worrying about its size?

@stuhood
Copy link
Member

stuhood commented Aug 16, 2023

Hm, actually: thinking about this a bit more, it's unusual that anything that is Running would actually not be reachable from a root. During a run, everything should be reachable from the current roots.

So it could be that you actually want to be storing last access time on all nodes, and then rather than worrying about roots at all, you would garbage collect nodes which hadn't been accessed in a while (...and which weren't running). That would allow you to safely garbage collect data that was idle, even during a run.

@jriddy
Copy link
Contributor Author

jriddy commented Aug 17, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants