Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay][RFC] Garbage Collection #3423

Closed
MarisaKirisame opened this issue Jun 23, 2019 · 7 comments
Closed

[Relay][RFC] Garbage Collection #3423

MarisaKirisame opened this issue Jun 23, 2019 · 7 comments

Comments

@MarisaKirisame
Copy link
Contributor

right now relay aot/interpreter/vm all use reference counting to collect memory. However, it is possible to create cyclic data structure in relay, as demonstrated below:

data Loop = Ref (Optional Loop)
x : Loop = Ref None
match x with
| r -> r := x
end

in short, we can has a data type holding a mutable (nullable) reference, initialize it to null, then point it to itself.

This example is purely contrived, but it is completely possible in real, meaningful relay code: imagine a doubly linked list, or two closure referencing each other. They all form cyclic link data which will never be collected by reference counting.

there are three problem we should discuss:

0: what algorithm should we use? mark and sweep/generational, etc?

1: should we still use reference counting?

2: how can we implement the runtime only once, for aot/interpreter/vm, instead of each of them having to implement the runtime itself?

maybe we can look at https://github.com/hsutter/gcpp?

@jroesch @tqchen @icemelon9 @wweic @junrushao1994 @nhynes any suggestion?

@wweic
Copy link
Contributor

wweic commented Jun 24, 2019

Thanks for bringing up this topic. We might need to consider the scenarios we want to use relay and design under those constraints. For now the main use cases is inference(maybe training in the future). I guess most of the objects allocated during inference will be TensorObject, plus a few DatatypeObject, we can run some benchmarks for major models to confirm. TensorObject shouldn't cause cyclic reference, and if it's the majority of the objects, we can handle them with existing reference counting. And for the remaining DatatypeObject, a mark sweep GC can be called infrequently. But benchmark first. :-)

Agree sharing GC. For AOT, should we learn how rust get rid of GC as a further optimization?

@tqchen
Copy link
Member

tqchen commented Jun 24, 2019

Personally, I am not in favor of introducing GC to the system. Reference counting was fine and is great for handling external resources(GPU memories etc). Exactly for the same reason, languages like java/scala had a pretty bad time working with GPU memories(memories not being de-allocated precisely at the point where data structure goes out of scope).

Ref counting has its limitation, but we can be mindful to avoid cycles and it works great for most of the cases we are looking at, without introducing additional problems of GC

@ajtulloch
Copy link
Contributor

cc @hlu1 re: the leak from e.g. recursive functions taking a reference to itself in the environment.

FWIW, can we solve these via adding the concept of weak references to the node system? It seems like in these cases that closures could use weak references to other closures, and have the higher-level env maintain the strong reference to the closure objects themselves?

@MarisaKirisame
Copy link
Contributor Author

@ajtulloch indeed. However, it will always be possible that ppl might uncarefully create strong reference loop, and be unable to collect.

@hlu1
Copy link
Contributor

hlu1 commented Jun 27, 2019

@MarisaKirisame, #3448 was the leak @ajtulloch was referring to.

@jroesch
Copy link
Member

jroesch commented Jul 1, 2019

I talked with Zach DeVito from PyTorch team for a while about RefCounting, there are quite a few benefits to using reference counting. We should probably just use weak refs, solutions from Counting Immutable Beans (a recent paper by my MSR collaborator where they do much better than GC languages in perf) or cycle detectors if we are worried about cycles. I think generally RC provides the best tradeoffs for ML right now.

@tqchen tqchen closed this as completed Jul 6, 2019
@tqchen
Copy link
Member

tqchen commented Jul 6, 2019

Seems we agreed that weak reference was better than gc. close this RFC thread for now. Thanks everyone who participated in the discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants