Garbage collect the v2 engine Graph #7675

stuhood · 2019-05-08T01:17:15Z

The v2 Graph (implemented in rust) does not implement garbage collection, although it is definitely feasible.

As we fix other issues and pantsd instances are able to stay up longer and longer, we should consider implementing garbage collection based on walking from recently requested roots in the Session.

The text was updated successfully, but these errors were encountered:

### Problem `pantsd` does not implement garbage collection of the `Graph` (see #7675), but additionally, there are likely a few Python-level reference cycles beyond those that we have already discovered. ### Solution We will eventually implement #7675, and it will need a config value to control how much it collects. But in the meantime, having a configurable built-in cap on total memory usage is generally useful, and can consume the same flag that our eventual collection will. ### Result `pantsd` will restart itself when it uses more than the configured amount of memory (defaulting to 4GB). As mentioned in the test comment, until #8200 is fixed, the message rendered when we restart will not be particularly friendly, so we should likely not cherry-pick this to 1.29.x, which will not receive #8200. This is not a complete fix for #9999, but I'm going to resolve it in favor of tracking followup in #7675. [ci skip-rust-tests] [ci skip-jvm-tests]

### Problem The default max memory usage from #10003 was chosen with larger monorepos in mind, and didn't match the expectation of consumers in smaller repos. ### Solution Very large repos will have folks who are able/willing to adjust limits like this to optimize for their repo size, so adjust the default down in favor of having good defaults. This relates to #7675, but it isn't the time to dive in there. ### Result Fixes #10264. [ci skip-rust-tests]

stuhood · 2022-10-17T20:46:32Z

One place in which we could cheaply start to do more garbage collection would be to convert the interning Keys that the engine currently uses to a WeakRef map of some sort. Currently, Get inputs are held forever: see

pants/src/rust/engine/src/interning.rs

Lines 11 to 43 in 5aebe76

    
           /// 
        
           /// A struct that encapsulates interning of python `Value`s as comparable `Key`s. 
        
           /// 
        
           /// To minimize the total amount of time spent in python code comparing objects (represented on 
        
           /// the rust side of the FFI boundary as `Value` instances) to one another, this API supports 
        
           /// memoizing `Value`s as `Key`s. 
        
           /// 
        
           /// Creating a `Key` involves interning a `Value` under a (private) `InternKey` struct which 
        
           /// implements `Hash` and `Eq` using the precomputed python `__hash__` for the `Value` and 
        
           /// delegating to python's `__eq__`, respectively. 
        
           /// 
        
           /// Currently `Value`s are interned indefinitely as `Key`s, meaning that they can never 
        
           /// be collected: it's possible that this can eventually be improved by either: 
        
           /// 
        
           ///   1) switching to directly linking-against or embedding python, such that the `Value` 
        
           ///      type goes away in favor of direct usage of a python object wrapper struct. 
        
           ///   2) This structure might begin storing weak-references to `Key`s and/or `Value`s, which 
        
           ///      would allow the associated `Value` handles to be dropped when they were no longer used. 
        
           ///      The challenge to this approach is that it would make it more difficult to pass 
        
           ///      `Key`/`Value` instances across the FFI boundary. 
        
           ///   3) `Value` could implement `Eq`/`Hash` directly via extern calls to python (although we've 
        
           ///      avoided doing this so far because it would hide a relatively expensive operation behind 
        
           ///      those usually-inexpensive traits). 
        
           /// 
        
           /// To avoid deadlocks, methods of Interns require that the GIL is held, and then explicitly release 
        
           /// it before acquiring inner locks. That way we can guarantee that these locks are always acquired 
        
           /// before the GIL (Value equality in particular might re-acquire it). 
        
           /// 
        
           pub struct Interns { 
        
             // A mapping between Python objects and integer ids. 
        
             keys: Py<PyDict>, 
        
             id_generator: atomic::AtomicU64, 
        
           }

stuhood · 2023-05-06T03:55:38Z

To accomplish garbage collection of Node outputs/values (but not of Nodes themselves, which effectively act as their own keys), we could probably:

Add a collection of roots with the time when they were requested to InnerGraph:
```
roots_by_age: HashMap<N, std::time::Instant>,
```
(confirm that your changes compile by running ./cargo check -p graph)
Adjust Graph::create and Graph::poll to record when roots were requested.
Adjust Entry::clear to optionally discard the previous_result. Currently the method name is a bit of a misnomer: it forces a Node to be recomputed, but still keeps the previous value in order to try and compute a generation value. But in this case, we want to free the memory and not worry about its dependents needing to re-run.
Add a method to InnerGraph that will walk the graph from "relevant" roots, and will then call Entry::clear on nodes which weren't reachable during the walk.
- How to define "relevant" is unclear: right now the graph crate isn't aware of Node sizes, so that will probably need to be a followup. But one approach might be to only consider roots_by_age which are newer than some window and/or account for some minimum number of roots that we want to keep.
Add a test that calls your collection method in tests.rs: probably have the method return a value that indicates how many nodes were cleared (for the purposes of the test). Use ./cargo test -p graph to check that it passes.
Add "periodic" calls to your collection method in cycle_check_task (and rename it to something more generic to maintenance).
- "Periodic" needs defining. But depending how efficiently the method runs when there is nothing to do, you could run it more or less frequently.
Confirm that the tests still pass by running ./cargo test -p graph.

jriddy · 2023-05-06T14:09:14Z

One place in which we could cheaply start to do more garbage collection would be to convert the interning Keys that the engine currently uses to a WeakRef map of some sort. Currently, Get inputs are held forever: see

pants/src/rust/engine/src/interning.rs

Lines 11 to 43 in 5aebe76

///

/// A struct that encapsulates interning of python `Value`s as comparable `Key`s.

///

/// To minimize the total amount of time spent in python code comparing objects (represented on

/// the rust side of the FFI boundary as `Value` instances) to one another, this API supports

/// memoizing `Value`s as `Key`s.

///

/// Creating a `Key` involves interning a `Value` under a (private) `InternKey` struct which

/// implements `Hash` and `Eq` using the precomputed python `__hash__` for the `Value` and

/// delegating to python's `__eq__`, respectively.

///

/// Currently `Value`s are interned indefinitely as `Key`s, meaning that they can never

/// be collected: it's possible that this can eventually be improved by either:

///

/// 1) switching to directly linking-against or embedding python, such that the `Value`

/// type goes away in favor of direct usage of a python object wrapper struct.

/// 2) This structure might begin storing weak-references to `Key`s and/or `Value`s, which

/// would allow the associated `Value` handles to be dropped when they were no longer used.

/// The challenge to this approach is that it would make it more difficult to pass

/// `Key`/`Value` instances across the FFI boundary.

/// 3) `Value` could implement `Eq`/`Hash` directly via extern calls to python (although we've

/// avoided doing this so far because it would hide a relatively expensive operation behind

/// those usually-inexpensive traits).

///

/// To avoid deadlocks, methods of Interns require that the GIL is held, and then explicitly release

/// it before acquiring inner locks. That way we can guarantee that these locks are always acquired

/// before the GIL (Value equality in particular might re-acquire it).

///

pub struct Interns {

// A mapping between Python objects and integer ids.

keys: Py<PyDict>,

id_generator: atomic::AtomicU64,

}

Is it worth going down this route? From my naive point of view it looks like you could do this as long as you could create a weakref to the object and implement a remove key callback on the interns struct.

jriddy · 2023-05-07T02:36:35Z

Add a collection of roots with the time when they were requested to InnerGraph:
roots_by_age: HashMap<N, std::time::Instant>,
(confirm that your changes compile by running ./pants check -p graph)

Adjust Graph::create and Graph::poll to record when roots were requested.

Okay Rust newb question: it looks like I can't do HashMap<N, Instant> in there because that leads to two fields in the struct owning the same data. And I don't think structs can reference self-owned data? So IIUC that means we need to just expand the value of the original nodes map to HashMap<N, (EntryId, Interval)>.

Also is there are particular reason you're suggesting age as the discriminant? Simplicity of implementation? Seems like access time might be a better discriminant long term, which could turn this into something resembling an LRU cache. But I guess that could be a follow-up

stuhood · 2023-05-07T20:26:06Z

Okay Rust newb question: it looks like I can't do HashMap<N, Instant> in there because that leads to two fields in the struct owning the same data. And I don't think structs can reference self-owned data? So IIUC that means we need to just expand the value of the original nodes map to HashMap<N, (EntryId, Interval)>.

When you run into cases like this early on, the answer will be to use Clone: i.e. let node2 = node.clone(). Types which are relatively cheaply copyable generally implement Clone (if they are very cheaply/simply copyable they implement Copy, which allows them to be copied automatically).

Also is there are particular reason you're suggesting age as the discriminant? Simplicity of implementation? Seems like access time might be a better discriminant long term, which could turn this into something resembling an LRU cache. But I guess that could be a follow-up

The idea behind using a HashMap was for it actually to be access time: when you overwrite an entry in the hashmap, it gets a newer access time.

Pushing this now to know if this is directionally or generally correct. This is an attempt at solving pantsbuild#7675. I'm aware there needs to be tests on the graphs, but I'm still trying to load the data structure semantics into my head to know how to construct a test. In the meanwhile, I'd like some feedback to see if this is going in the right direction or if I'm totally off base here. Also feel free to correct me or push me to better practices on my Rust.

stuhood · 2023-09-29T20:23:51Z

Commented on #14676 (comment): actually allowing Keys to be garbage collected would require that we fully delete Nodes in the graph crate: described there.

stuhood added the engine label May 8, 2019

stuhood mentioned this issue May 8, 2019

Support GC of object instances that use @memo* #6555

Closed

stuhood added the pantsd label May 22, 2020

This was referenced Jun 8, 2020

pantsd memory leak #9999

Closed

Add a configurable cap on total pantsd memory usage. #10003

Merged

stuhood mentioned this issue Jul 7, 2020

Lower the default max-memory usage of pantsd. #10287

Merged

benjyw removed engine labels Sep 9, 2021

stuhood mentioned this issue Mar 7, 2023

Give pantsd more RAM by default. #18389

Merged

stuhood assigned jriddy May 6, 2023

jriddy mentioned this issue May 20, 2023

graph: first pass attempt at garbage collection #19070

Closed

jriddy mentioned this issue Jul 23, 2023

Add rudimentary garbage collection to graph #19513

Closed

stuhood mentioned this issue Sep 29, 2023

Use WeakKeyDictionary for Key interning #14676

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage collect the v2 engine Graph #7675

Garbage collect the v2 engine Graph #7675

stuhood commented May 8, 2019

stuhood commented Oct 17, 2022

stuhood commented May 6, 2023 •

edited

Loading

jriddy commented May 6, 2023

jriddy commented May 7, 2023

stuhood commented May 7, 2023 •

edited

Loading

stuhood commented Sep 29, 2023

Garbage collect the v2 engine Graph #7675

Garbage collect the v2 engine Graph #7675

Comments

stuhood commented May 8, 2019

stuhood commented Oct 17, 2022

stuhood commented May 6, 2023 • edited Loading

jriddy commented May 6, 2023

jriddy commented May 7, 2023

stuhood commented May 7, 2023 • edited Loading

stuhood commented Sep 29, 2023

stuhood commented May 6, 2023 •

edited

Loading

stuhood commented May 7, 2023 •

edited

Loading