Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collect the v2 engine Graph #7675

Open
stuhood opened this issue May 8, 2019 · 6 comments
Open

Garbage collect the v2 engine Graph #7675

stuhood opened this issue May 8, 2019 · 6 comments
Assignees

Comments

@stuhood
Copy link
Member

stuhood commented May 8, 2019

The v2 Graph (implemented in rust) does not implement garbage collection, although it is definitely feasible.

As we fix other issues and pantsd instances are able to stay up longer and longer, we should consider implementing garbage collection based on walking from recently requested roots in the Session.

@stuhood stuhood added the engine label May 8, 2019
@stuhood stuhood added the pantsd label May 22, 2020
stuhood added a commit that referenced this issue Jun 9, 2020
### Problem

`pantsd` does not implement garbage collection of the `Graph` (see #7675), but additionally, there are likely a few Python-level reference cycles beyond those that we have already discovered.

### Solution

We will eventually implement #7675, and it will need a config value to control how much it collects. But in the meantime, having a configurable built-in cap on total memory usage is generally useful, and can consume the same flag that our eventual collection will.

### Result

`pantsd` will restart itself when it uses more than the configured amount of memory (defaulting to 4GB). As mentioned in the test comment, until #8200 is fixed, the message rendered when we restart will not be particularly friendly, so we should likely not cherry-pick this to 1.29.x, which will not receive #8200.

This is not a complete fix for #9999, but I'm going to resolve it in favor of tracking followup in #7675.

[ci skip-rust-tests]
[ci skip-jvm-tests]
stuhood added a commit that referenced this issue Jul 7, 2020
### Problem

The default max memory usage from #10003 was chosen with larger monorepos in mind, and didn't match the expectation of consumers in smaller repos. 

### Solution

Very large repos will have folks who are able/willing to adjust limits like this to optimize for their repo size, so adjust the default down in favor of having good defaults.  This relates to #7675, but it isn't the time to dive in there.

### Result

Fixes #10264.

[ci skip-rust-tests]
@benjyw benjyw removed engine labels Sep 9, 2021
@stuhood
Copy link
Member Author

stuhood commented Oct 17, 2022

One place in which we could cheaply start to do more garbage collection would be to convert the interning Keys that the engine currently uses to a WeakRef map of some sort. Currently, Get inputs are held forever: see

///
/// A struct that encapsulates interning of python `Value`s as comparable `Key`s.
///
/// To minimize the total amount of time spent in python code comparing objects (represented on
/// the rust side of the FFI boundary as `Value` instances) to one another, this API supports
/// memoizing `Value`s as `Key`s.
///
/// Creating a `Key` involves interning a `Value` under a (private) `InternKey` struct which
/// implements `Hash` and `Eq` using the precomputed python `__hash__` for the `Value` and
/// delegating to python's `__eq__`, respectively.
///
/// Currently `Value`s are interned indefinitely as `Key`s, meaning that they can never
/// be collected: it's possible that this can eventually be improved by either:
///
/// 1) switching to directly linking-against or embedding python, such that the `Value`
/// type goes away in favor of direct usage of a python object wrapper struct.
/// 2) This structure might begin storing weak-references to `Key`s and/or `Value`s, which
/// would allow the associated `Value` handles to be dropped when they were no longer used.
/// The challenge to this approach is that it would make it more difficult to pass
/// `Key`/`Value` instances across the FFI boundary.
/// 3) `Value` could implement `Eq`/`Hash` directly via extern calls to python (although we've
/// avoided doing this so far because it would hide a relatively expensive operation behind
/// those usually-inexpensive traits).
///
/// To avoid deadlocks, methods of Interns require that the GIL is held, and then explicitly release
/// it before acquiring inner locks. That way we can guarantee that these locks are always acquired
/// before the GIL (Value equality in particular might re-acquire it).
///
pub struct Interns {
// A mapping between Python objects and integer ids.
keys: Py<PyDict>,
id_generator: atomic::AtomicU64,
}

@stuhood
Copy link
Member Author

stuhood commented May 6, 2023

To accomplish garbage collection of Node outputs/values (but not of Nodes themselves, which effectively act as their own keys), we could probably:

  1. Add a collection of roots with the time when they were requested to InnerGraph:
    roots_by_age: HashMap<N, std::time::Instant>,
  2. (confirm that your changes compile by running ./cargo check -p graph)
  3. Adjust Graph::create and Graph::poll to record when roots were requested.
  4. Adjust Entry::clear to optionally discard the previous_result. Currently the method name is a bit of a misnomer: it forces a Node to be recomputed, but still keeps the previous value in order to try and compute a generation value. But in this case, we want to free the memory and not worry about its dependents needing to re-run.
  5. Add a method to InnerGraph that will walk the graph from "relevant" roots, and will then call Entry::clear on nodes which weren't reachable during the walk.
    • How to define "relevant" is unclear: right now the graph crate isn't aware of Node sizes, so that will probably need to be a followup. But one approach might be to only consider roots_by_age which are newer than some window and/or account for some minimum number of roots that we want to keep.
  6. Add a test that calls your collection method in tests.rs: probably have the method return a value that indicates how many nodes were cleared (for the purposes of the test). Use ./cargo test -p graph to check that it passes.
  7. Add "periodic" calls to your collection method in cycle_check_task (and rename it to something more generic to maintenance).
    • "Periodic" needs defining. But depending how efficiently the method runs when there is nothing to do, you could run it more or less frequently.
  8. Confirm that the tests still pass by running ./cargo test -p graph.

@jriddy
Copy link
Contributor

jriddy commented May 6, 2023

One place in which we could cheaply start to do more garbage collection would be to convert the interning Keys that the engine currently uses to a WeakRef map of some sort. Currently, Get inputs are held forever: see

///
/// A struct that encapsulates interning of python `Value`s as comparable `Key`s.
///
/// To minimize the total amount of time spent in python code comparing objects (represented on
/// the rust side of the FFI boundary as `Value` instances) to one another, this API supports
/// memoizing `Value`s as `Key`s.
///
/// Creating a `Key` involves interning a `Value` under a (private) `InternKey` struct which
/// implements `Hash` and `Eq` using the precomputed python `__hash__` for the `Value` and
/// delegating to python's `__eq__`, respectively.
///
/// Currently `Value`s are interned indefinitely as `Key`s, meaning that they can never
/// be collected: it's possible that this can eventually be improved by either:
///
/// 1) switching to directly linking-against or embedding python, such that the `Value`
/// type goes away in favor of direct usage of a python object wrapper struct.
/// 2) This structure might begin storing weak-references to `Key`s and/or `Value`s, which
/// would allow the associated `Value` handles to be dropped when they were no longer used.
/// The challenge to this approach is that it would make it more difficult to pass
/// `Key`/`Value` instances across the FFI boundary.
/// 3) `Value` could implement `Eq`/`Hash` directly via extern calls to python (although we've
/// avoided doing this so far because it would hide a relatively expensive operation behind
/// those usually-inexpensive traits).
///
/// To avoid deadlocks, methods of Interns require that the GIL is held, and then explicitly release
/// it before acquiring inner locks. That way we can guarantee that these locks are always acquired
/// before the GIL (Value equality in particular might re-acquire it).
///
pub struct Interns {
// A mapping between Python objects and integer ids.
keys: Py<PyDict>,
id_generator: atomic::AtomicU64,
}

Is it worth going down this route? From my naive point of view it looks like you could do this as long as you could create a weakref to the object and implement a remove key callback on the interns struct.

@jriddy
Copy link
Contributor

jriddy commented May 7, 2023

  • Add a collection of roots with the time when they were requested to InnerGraph:
    roots_by_age: HashMap<N, std::time::Instant>,
  • (confirm that your changes compile by running ./pants check -p graph)
  • Adjust Graph::create and Graph::poll to record when roots were requested.

Okay Rust newb question: it looks like I can't do HashMap<N, Instant> in there because that leads to two fields in the struct owning the same data. And I don't think structs can reference self-owned data? So IIUC that means we need to just expand the value of the original nodes map to HashMap<N, (EntryId, Interval)>.

Also is there are particular reason you're suggesting age as the discriminant? Simplicity of implementation? Seems like access time might be a better discriminant long term, which could turn this into something resembling an LRU cache. But I guess that could be a follow-up

@stuhood
Copy link
Member Author

stuhood commented May 7, 2023

Okay Rust newb question: it looks like I can't do HashMap<N, Instant> in there because that leads to two fields in the struct owning the same data. And I don't think structs can reference self-owned data? So IIUC that means we need to just expand the value of the original nodes map to HashMap<N, (EntryId, Interval)>.

When you run into cases like this early on, the answer will be to use Clone: i.e. let node2 = node.clone(). Types which are relatively cheaply copyable generally implement Clone (if they are very cheaply/simply copyable they implement Copy, which allows them to be copied automatically).

Also is there are particular reason you're suggesting age as the discriminant? Simplicity of implementation? Seems like access time might be a better discriminant long term, which could turn this into something resembling an LRU cache. But I guess that could be a follow-up

The idea behind using a HashMap was for it actually to be access time: when you overwrite an entry in the hashmap, it gets a newer access time.

jriddy added a commit to jriddy/pants that referenced this issue May 20, 2023
Pushing this now to know if this is directionally or generally correct.

This is an attempt at solving pantsbuild#7675.  I'm aware there needs to be tests
on the graphs, but I'm still trying to load the data structure semantics
into my head to know how to construct a test.  In the meanwhile, I'd like
some feedback to see if this is going in the right direction or if I'm
totally off base here.

Also feel free to correct me or push me to better practices on my Rust.
jriddy added a commit to jriddy/pants that referenced this issue May 21, 2023
Pushing this now to know if this is directionally or generally correct.

This is an attempt at solving pantsbuild#7675.  I'm aware there needs to be tests
on the graphs, but I'm still trying to load the data structure semantics
into my head to know how to construct a test.  In the meanwhile, I'd like
some feedback to see if this is going in the right direction or if I'm
totally off base here.

Also feel free to correct me or push me to better practices on my Rust.
jriddy added a commit to jriddy/pants that referenced this issue Jul 23, 2023
Pushing this now to know if this is directionally or generally correct.

This is an attempt at solving pantsbuild#7675.  I'm aware there needs to be tests
on the graphs, but I'm still trying to load the data structure semantics
into my head to know how to construct a test.  In the meanwhile, I'd like
some feedback to see if this is going in the right direction or if I'm
totally off base here.

Also feel free to correct me or push me to better practices on my Rust.
@stuhood
Copy link
Member Author

stuhood commented Sep 29, 2023

Commented on #14676 (comment): actually allowing Keys to be garbage collected would require that we fully delete Nodes in the graph crate: described there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants