-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep weak references to eager resources in session #229
Conversation
@rnett , this PR might conflict with your actual work with |
Seems fine, I'll check when I rebase. I don't think it will conflict too much from a quick look, I don't really touch the eager handles. |
There is absolutely no guarantee that GC will run even and especially when memory is low. Your application can and will crash. The GC will not help you. Please stop thinking of GC as something that you should be using. You should not be using it at all. |
@saudet I know there is no guarantee and we can review this whole paradigm later on but what really matters now is to restore the original behavior which, afaik, never caused any issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EagerSession.attach
and EagerSession.detach
methods now behave quite differently to how they used to, and differently from PointerScope.attach/detach
, so I think it's worth documenting their current behaviour before we merge this in. In general it might be worth documenting with comments that this is the current behaviour so we don't lose it in another refactor. Otherwise in a year we'll need to go through the git history to figure out why this happened.
} | ||
|
||
@Override | ||
public synchronized void close() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This close method doesn't prevent the object from being reused. Do we want it to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is mandatory but it does not make much sense reusing it also. I can reset the pointers
to null on close if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking a boolean closed
which is checked, but setting the pointers to null is probably fine. The latter will crash if closed multiple times though.
Ok @Craigacp , I've made the requested changes plus added some basic unit testing on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the quick review @Craigacp ! |
Actually, this "feature" could cause problems when the user sets "org.bytedeco.javacpp.nopointergc". In that case, we must make sure that deallocation is performed manually or via something like PointerScope. If a reference is lost by a WeakReference, memory leaks will occur. I'm sorry to have proposed this, but this was actually a really bad idea... |
Still, I don't think we need to modify JavaCPP to get this working. Instead of keeping a WeakReference to the allocated Pointer, it should suffice to keep strong references to the Deallocator objects that are registered, for example, in the case of Tensor, here: https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/c_api/AbstractTF_Tensor.java |
Turning off the pointer gc in JavaCPP seems to be a bad idea, given we can't enforce people use try with resources to make them clean up after themselves. The TF 1.x code had a background thread as a safety net, which mirrors what JavaCPP's pointer GC does, so we can just tell people it's not supported to turn that off (or even check on startup and throw an exception if it's disabled). When we move to Java 11 we can migrate this into a cleaner and have the JVM manage that thread for us. What would be an alternative that allows |
Nevermind, I misread the if statement. If the pointergc is turned off, what is cleaning the deallocator reference queue? |
@saudet can you explain better what you mean by losing a weak reference? Ok so I wanted to keep that conversation as a separate issue but just for the sake of stopping adding quotes around "fix" and "feature", let's clarify a few things :) I might have expressed myself incorrectly in the description of this PR but all eager resources are actually protected by a scope, independently from the GC, and this scope if hold by In DJL, it seems that their eager session live for relatively a long time, probably enough to accumulate many thousands of operations. Each operation created within a session remain alive until the session is closed since the library don't know upfront if and when a user will need to access them (to retrieve a result tensor, for example). So the GC here is our friend, giving us a hand to detect when it is safe to deallocate an operation while the session is still alive and does a pretty good job doing it, especially that native objects referenced by this operations are just small value objects and not large tensors. But if the user manages its eager sessions so that they are closed as soon as any of their operations has been consumed by the software, they won't need the help of the GC. And this is already supported by the actual code. Something we should probably do is to either get rid of the dangerous default eager session (since that one is never closed) or document it better to warn the users that working with long-live sessions could potentially ends up in OOM (even if chances are high that they will be prevented with the assistance of the GC, we cannot guarantee it). Also something I'm not too clear about is the usage of eager mode at all in production. Often I see in TF documentation that this mode is useful for debugging but for good performances, distribute training, etc. you should rely on graphs. |
Disabling the GC features of JavaCPP increases performance. You're the one that says that we should work with the community, so if DJL asks for this, think about what is going to be your reply. If you're basically going to tell them that TF Java doesn't care about performance, what do you think they are going to do? And they're already thinking about it. You're just confirming here their suspicions that you don't care about their needs. That said, it's possible that disabling all this stuff gives us nothing compared to the overhead of TF Core, sure, but be prepared to back up your claims.
Users that need it can use TensorScope wherever appropriate, but for users that don't do that, they don't need to do anything, and the GC will do its best to clean up the mess. It's not any different from how TF 1.x was working.
There is no reference queue, it's not used. It's unnecessary baggage that only reduces performance: |
In a choice between correctness and performance but random memory crashes, I will pick correctness. Also how much does this affect performance, and is this performance actually an issue on different JVMs (given that with the introduction of Java 9 cleaners it's likely to have received some optimisations)? DJL's main issue seemed to be the memory leak, and this PR fixes that. Their use of constant operations is a worry, but they should revert that and move back to using
Ok, so if the pointergc is turned off then it will leak memory if things aren't always enclosed by a pointer scope which is closed? |
I’m curious to know more about the performance drop when GC is enabled, @saudet do you have some metrics to share? Also just wanted to put emphasis one more time that TF Java works totally fine with GC disabled, it’s just a matter of closing the eager sessions in time. |
On the other hand, what @rnett proposes in #188 is to at least enforce the scoping of the tensors that are resulting of an eager operation (i.e. While it adds some additional complexity on the user, it helps him to stay on a safe track without relying too much on the GC, which now seems to be a good trade off. |
I'm still in favor of adding a scope for eager operands (well, really all operands to keep it environment agnostic, but it wouldn't do much for Graph), and we could eventually re-add a |
I can speak for the performance issues while enabling GC and using DJL on TF Java 0.0.2 - When using nopointergc=true - GC gets a lot of breathing room, no more continous churning in cleanup, but OOM happens a lot and a lot faster. I move somewhere from 1 gigabyte a second to 15 gigabytes a second of data through inference. I haven't tried the fix merged here as I've moved to use DJL and PyTorch which has no GC or memory leak issues so far. Once the fix is released and DJl updates its TF dependencies Ill give it a try again as I have lots of use cases for DJL and Tensorflow. |
The JDK's implementation of the Cleaner doesn't do anything special. It just uses PhantomReference with a ReferenceQueue, and spins that in a thread. It's just for convenience:
As @skirdey points out, I don't believe that's the case. We need to keep the conversation going with them instead of assuming that we know what's best for everyone!
That paradigm works fine in C++ and Python where "scopes" like that come built-in with the language. Do you have an example of something that sounds hard to do?
Yes, that's correct. We can also close pointers individually too.
No, it won't work correctly the way things are right now. Please reread what I wrote above. I've taken a look back at how it was done for TF 1.x as well, and its implementation is also incorrect. With that one, even with GC, references can (and will) leak.
Would you have a small piece of code to share with us that faithfully demonstrates the issue? It would help make sure that we come up with a solution that everyone likes and that actually meets your needs. |
I've described the issue and the way to reproduce it by setting
nopointergc=true while using stock DJL benchmark and gradle.
You can also run same benchmark using PyTorch engine instead
deepjavalibrary/djl#690
…On Tuesday, 2 March 2021, Samuel Audet ***@***.***> wrote:
In a choice between correctness and performance but random memory crashes,
I will pick correctness. Also how much does this affect performance, and is
this performance actually an issue on different JVMs (given that with the
introduction of Java 9 cleaners it's likely to have received some
optimisations)?
The JDK's implementation of the Cleaner doesn't do anything special. It
just uses PhantomReference with a ReferenceQueue, and spins that in a
thread. It's just for convenience:
https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/
share/classes/jdk/internal/ref/CleanerImpl.java
DJL's main issue seemed to be the memory leak, and this PR fixes that.
Their use of constant operations is a worry, but they should revert that
and move back to using Tensor.
As @skirdey <https://github.com/skirdey> points out, I don't believe
that's the case. We need to keep the conversation going with them instead
of assuming that we know what's best for everyone!
TensorScope wouldn't work for cleaning up EagerOperation. It's a
different handle. We could introduce an EagerOperationScope too, but how
would we recommend people use that? Tensors have well defined lifetimes wrt
a program, but defining the lifetime of an operation sounds much harder.
That paradigm works fine in C++ and Python where "scopes" like that come
built-in with the language. Do you have an example of something that sounds
hard to do?
There is no reference queue, it's not used. It's unnecessary baggage that
only reduces performance:
https://github.com/bytedeco/javacpp/blob/1.5.4/src/main/
java/org/bytedeco/javacpp/Pointer.java#L460-L468
Ok, so if the pointergc is turned off then it will leak memory if things
aren't always enclosed by a pointer scope which is closed?
Yes, that's correct. We can also close pointers individually too.
I’m curious to know more about the performance drop when GC is enabled,
@saudet <https://github.com/saudet> do you have some metrics to share?
Also just wanted to put emphasis one more time that TF Java *works
totally fine with GC disabled*, it’s just a matter of closing the eager
sessions in time.
No, it won't work correctly the way things are right now. Please reread
what I wrote above. I've taken a look back at how it was done for TF 1.x as
well, and its implementation is also incorrect. With that one, even with
GC, references can (and will) leak.
Once the fix is released and DJl updates its TF dependencies Ill give it a
try again as I have lots of use cases for DJL and Tensorflow.
Would you have a small piece of code to share with us that faithfully
demonstrates the issue? It would help make sure that we come up with a
solution that everyone likes *and* that actually meets your needs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#229 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHIJQAVDTRGPTT4D5BZ35LTBXSR5ANCNFSM4YL2G5XQ>
.
|
Thanks! Missed the bit about that being the standard benchmark from DJL. Would you have something that uses TF only though? |
Not at the moment, I haven't used TF Java outside of DJL
…On Wednesday, 3 March 2021, Samuel Audet ***@***.***> wrote:
I've described the issue and the way to reproduce it by setting
nopointergc=true while using stock DJL benchmark and gradle. You can also
run same benchmark using PyTorch engine instead deepjavalibrary/djl#690
<deepjavalibrary/djl#690>
Thanks! Missed the bit about that being the standard benchmark from DJL.
Would you have something that uses TF only though?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#229 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHIJQAGK262JEGC4M7EUETTBXWXZANCNFSM4YL2G5XQ>
.
|
The thread does not seem to be So when the synchronization happens in JavaCPP, when it frees the memory? Could that be prevented or the trick is to deallocate less often, in chunks (scopes), rather than multiple small pieces (which is what the GC would do)? |
Yes, but idioms inside the JDK tend to receive performance optimisations from the JVM which might not be visible at the class file level. For example that CleanerImpl class references
Well in Python TF they don't scope the eager operations, so we should figure out what the lifetime of them is and see if there is a natural mapping to Java. But more generally it's easy to say when a tensor is being referenced and thus what it's lifetime is. With an eager operation should we scope them at a method level (e.g. if I have a method that implements a single layer of a ResNet, should we scope the ops to that and have them be closed in a try-with-resources on method exit), or does that cause other performance issues when run inside a loop (e.g. to build up the rest of the ResNet). I don't know the answers, and we should try to figure that out before introducing scoping on something.
Where's the leak? I went through and couldn't see how a reference could escape the set, and with the pointer gc on it will catch and deallocate everything. I agree with it turned off then this leaks, but again we could just tell people that's not supported. |
Like I said at #208 (comment) and a bunch of other places, it will not block if you set Access to the linked list used by the Cleaner is synchronized, yes:
Like I keep telling you guys, thinking about GC is a dead end. Please stop thinking about anything related to GC!
Yeah, ok, but it still doesn't make GC work any better for native resources. Panama has been having problems with this for over 5 years now, and the JDK in general for over 20 years (!!), and they've just started to get a grip on the situation, whereas
Well, look, I keep telling you, I'm pretty sure DJL needs something like this, and for the kind of applications they are working on, as @skirdey found out and wrote at #208 (comment), something like
Right, |
This is a fix for #208
As discussed in this thread, when migrating to JavaCPP, a bug/change of behavior was introduced that was preventing the allocated eager resources to be garbage collected while the session was alive. On long-live sessions (especially when using the default session), the help of the GC is mandatory to prevent OOM.
This PR restores the original behavior where resources will be automatically freed once the session is closed, without preventing them to be garbage-collected upfront when they are unreachable and that memory runs low.