Keep weak references to eager resources in session #229

karllessard · 2021-03-01T03:57:14Z

This is a fix for #208

As discussed in this thread, when migrating to JavaCPP, a bug/change of behavior was introduced that was preventing the allocated eager resources to be garbage collected while the session was alive. On long-live sessions (especially when using the default session), the help of the GC is mandatory to prevent OOM.

This PR restores the original behavior where resources will be automatically freed once the session is closed, without preventing them to be garbage-collected upfront when they are unreachable and that memory runs low.

karllessard · 2021-03-01T04:00:38Z

@rnett , this PR might conflict with your actual work with TensorScope but I suggest we merge it first since it fixes a critical issue present in our library. We can review together afterward how to integrate both concepts together.

rnett · 2021-03-01T05:35:44Z

Seems fine, I'll check when I rebase. I don't think it will conflict too much from a quick look, I don't really touch the eager handles.

saudet · 2021-03-01T13:14:44Z

This PR restores the original behavior where resources will be automatically freed once the session is closed, without preventing them to be garbage-collected upfront when they are unreachable and that memory runs low.

There is absolutely no guarantee that GC will run even and especially when memory is low. Your application can and will crash. The GC will not help you. Please stop thinking of GC as something that you should be using. You should not be using it at all.

karllessard · 2021-03-01T14:05:13Z

@saudet I know there is no guarantee and we can review this whole paradigm later on but what really matters now is to restore the original behavior which, afaik, never caused any issues.

Craigacp

The EagerSession.attach and EagerSession.detach methods now behave quite differently to how they used to, and differently from PointerScope.attach/detach, so I think it's worth documenting their current behaviour before we merge this in. In general it might be worth documenting with comments that this is the current behaviour so we don't lose it in another refactor. Otherwise in a year we'll need to go through the git history to figure out why this happened.

Craigacp · 2021-03-01T22:34:50Z

tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/WeakPointerScope.java

+  }
+
+  @Override
+  public synchronized void close() {


This close method doesn't prevent the object from being reused. Do we want it to?

I don't think it is mandatory but it does not make much sense reusing it also. I can reset the pointers to null on close if you prefer.

I was thinking a boolean closed which is checked, but setting the pointers to null is probably fine. The latter will crash if closed multiple times though.

karllessard · 2021-03-02T02:39:46Z

Ok @Craigacp , I've made the requested changes plus added some basic unit testing on WeakPointerScope itself.

Craigacp

LGTM

karllessard · 2021-03-02T04:15:49Z

Thanks for the quick review @Craigacp !

saudet · 2021-03-02T08:06:38Z

Actually, this "feature" could cause problems when the user sets "org.bytedeco.javacpp.nopointergc". In that case, we must make sure that deallocation is performed manually or via something like PointerScope. If a reference is lost by a WeakReference, memory leaks will occur. I'm sorry to have proposed this, but this was actually a really bad idea...

saudet · 2021-03-02T08:53:54Z

Still, I don't think we need to modify JavaCPP to get this working. Instead of keeping a WeakReference to the allocated Pointer, it should suffice to keep strong references to the Deallocator objects that are registered, for example, in the case of Tensor, here: https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/c_api/AbstractTF_Tensor.java
When memory has already been deallocated, those will be null, but that may (read "will") end up in a race condition with the deallocator thread... so you'll probably need to provide some additional locking mechanism here.

Craigacp · 2021-03-02T16:05:28Z

Actually, this "feature" could cause problems when the user sets "org.bytedeco.javacpp.nopointergc". In that case, we must make sure that deallocation is performed manually or via something like PointerScope. If a reference is lost by a WeakReference, memory leaks will occur. I'm sorry to have proposed this, but this was actually a really bad idea...

Turning off the pointer gc in JavaCPP seems to be a bad idea, given we can't enforce people use try with resources to make them clean up after themselves. The TF 1.x code had a background thread as a safety net, which mirrors what JavaCPP's pointer GC does, so we can just tell people it's not supported to turn that off (or even check on startup and throw an exception if it's disabled). When we move to Java 11 we can migrate this into a cleaner and have the JVM manage that thread for us.

What would be an alternative that allows EagerOperations to be cleaned up when they go out of scope? The initial JavaCPP implementation doesn't work as it unnecessarily holds on to too much memory, and both this and the TF 1.x version look like they should work (assuming nopointergc is false in this version).

Craigacp · 2021-03-02T16:11:20Z

Still, I don't think we need to modify JavaCPP to get this working. Instead of keeping a WeakReference to the allocated Pointer, it should suffice to keep strong references to the Deallocator objects that are registered, for example, in the case of Tensor, here: https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/c_api/AbstractTF_Tensor.java
When memory has already been deallocated, those will be null, but that may (read "will") end up in a race condition with the deallocator thread... so you'll probably need to provide some additional locking mechanism here.

The documentation for the Pointer constructor that bottoms out in says it keeps a reference to the thing that's passed in (even though the code doesn't look like it does). Given the deallocator keeping a strong reference to the object is what we don't want, what does it do? The documented behaviour or what it looks like the Java code does? I ask because I'm not sure if the generate JNI is playing tricks, because IntelliJ isn't clever enough to figure out where pointers get attached to scopes so it's very hard to trace through JavaCPP code.

Nevermind, I misread the if statement. If the pointergc is turned off, what is cleaning the deallocator reference queue?

karllessard · 2021-03-02T17:13:45Z

If a reference is lost by a WeakReference

@saudet can you explain better what you mean by losing a weak reference? Ok so I wanted to keep that conversation as a separate issue but just for the sake of stopping adding quotes around "fix" and "feature", let's clarify a few things :)

I might have expressed myself incorrectly in the description of this PR but all eager resources are actually protected by a scope, independently from the GC, and this scope if hold by EagerSession itself. The problem is that there is no obvious way for the library to control the number of operations a user might do inside a single session, it really depends on the design of their software.

In DJL, it seems that their eager session live for relatively a long time, probably enough to accumulate many thousands of operations. Each operation created within a session remain alive until the session is closed since the library don't know upfront if and when a user will need to access them (to retrieve a result tensor, for example).

So the GC here is our friend, giving us a hand to detect when it is safe to deallocate an operation while the session is still alive and does a pretty good job doing it, especially that native objects referenced by this operations are just small value objects and not large tensors. But if the user manages its eager sessions so that they are closed as soon as any of their operations has been consumed by the software, they won't need the help of the GC. And this is already supported by the actual code.

Something we should probably do is to either get rid of the dangerous default eager session (since that one is never closed) or document it better to warn the users that working with long-live sessions could potentially ends up in OOM (even if chances are high that they will be prevented with the assistance of the GC, we cannot guarantee it).

Also something I'm not too clear about is the usage of eager mode at all in production. Often I see in TF documentation that this mode is useful for debugging but for good performances, distribute training, etc. you should rely on graphs.

saudet · 2021-03-03T00:43:21Z

Turning off the pointer gc in JavaCPP seems to be a bad idea, given we can't enforce people use try with resources to make them clean up after themselves. The TF 1.x code had a background thread as a safety net, which mirrors what JavaCPP's pointer GC does, so we can just tell people it's not supported to turn that off (or even check on startup and throw an exception if it's disabled). When we move to Java 11 we can migrate this into a cleaner and have the JVM manage that thread for us.

Disabling the GC features of JavaCPP increases performance. You're the one that says that we should work with the community, so if DJL asks for this, think about what is going to be your reply. If you're basically going to tell them that TF Java doesn't care about performance, what do you think they are going to do? And they're already thinking about it. You're just confirming here their suspicions that you don't care about their needs. That said, it's possible that disabling all this stuff gives us nothing compared to the overhead of TF Core, sure, but be prepared to back up your claims.

What would be an alternative that allows EagerOperations to be cleaned up when they go out of scope? The initial JavaCPP implementation doesn't work as it unnecessarily holds on to too much memory, and both this and the TF 1.x version look like they should work (assuming nopointergc is false in this version).

Users that need it can use TensorScope wherever appropriate, but for users that don't do that, they don't need to do anything, and the GC will do its best to clean up the mess. It's not any different from how TF 1.x was working.

Nevermind, I misread the if statement. If the pointergc is turned off, what is cleaning the deallocator reference queue?

There is no reference queue, it's not used. It's unnecessary baggage that only reduces performance:
https://github.com/bytedeco/javacpp/blob/1.5.4/src/main/java/org/bytedeco/javacpp/Pointer.java#L460-L468

Craigacp · 2021-03-03T01:38:18Z

Turning off the pointer gc in JavaCPP seems to be a bad idea, given we can't enforce people use try with resources to make them clean up after themselves. The TF 1.x code had a background thread as a safety net, which mirrors what JavaCPP's pointer GC does, so we can just tell people it's not supported to turn that off (or even check on startup and throw an exception if it's disabled). When we move to Java 11 we can migrate this into a cleaner and have the JVM manage that thread for us.

Disabling the GC features of JavaCPP increases performance. You're the one that says that we should work with the community, so if DJL asks for this, think about what is going to be your reply. If you're basically going to tell them that TF Java doesn't care about performance, what do you think they are going to do? And they're already thinking about it. You're just confirming here their suspicions that you don't care about their needs. That said, it's possible that disabling all this stuff gives us nothing compared to the overhead of TF Core, sure, but be prepared to back up your claims.

In a choice between correctness and performance but random memory crashes, I will pick correctness. Also how much does this affect performance, and is this performance actually an issue on different JVMs (given that with the introduction of Java 9 cleaners it's likely to have received some optimisations)?

DJL's main issue seemed to be the memory leak, and this PR fixes that. Their use of constant operations is a worry, but they should revert that and move back to using Tensor.

What would be an alternative that allows EagerOperations to be cleaned up when they go out of scope? The initial JavaCPP implementation doesn't work as it unnecessarily holds on to too much memory, and both this and the TF 1.x version look like they should work (assuming nopointergc is false in this version).

Users that need it can use TensorScope wherever appropriate, but for users that don't do that, they don't need to do anything, and the GC will do its best to clean up the mess. It's not any different from how TF 1.x was working.

TensorScope wouldn't work for cleaning up EagerOperation. It's a different handle. We could introduce an EagerOperationScope too, but how would we recommend people use that? Tensors have well defined lifetimes wrt a program, but defining the lifetime of an operation sounds much harder.

Nevermind, I misread the if statement. If the pointergc is turned off, what is cleaning the deallocator reference queue?

There is no reference queue, it's not used. It's unnecessary baggage that only reduces performance:
https://github.com/bytedeco/javacpp/blob/1.5.4/src/main/java/org/bytedeco/javacpp/Pointer.java#L460-L468

Ok, so if the pointergc is turned off then it will leak memory if things aren't always enclosed by a pointer scope which is closed?

karllessard · 2021-03-03T01:46:26Z

I’m curious to know more about the performance drop when GC is enabled, @saudet do you have some metrics to share?

Also just wanted to put emphasis one more time that TF Java works totally fine with GC disabled, it’s just a matter of closing the eager sessions in time.

karllessard · 2021-03-03T02:07:14Z

On the other hand, what @rnett proposes in #188 is to at least enforce the scoping of the tensors that are resulting of an eager operation (i.e. operand.tensor(scope)).

While it adds some additional complexity on the user, it helps him to stay on a safe track without relying too much on the GC, which now seems to be a good trade off.

rnett · 2021-03-03T02:28:39Z

On the other hand, what @rnett proposes in #188 is to at least enforce the scoping of the tensors that are resulting of an eager operation (i.e. operand.tensor(scope)).

While it adds some additional complexity on the user, it helps him to stay on a safe track without relying too much on the GC, which now seems to be a good trade off.

I'm still in favor of adding a scope for eager operands (well, really all operands to keep it environment agnostic, but it wouldn't do much for Graph), and we could eventually re-add a asTensor() method that uses the operand scope. As @Craigacp said though it's going to be a bit more complicated than tensors (although I don't think it's too bad) and there will be more cases where it's not applicable. Plus for tensors detach() is a thing, so you can still get them scope free, it's just a bit harder (which imo is a feature).

skirdey · 2021-03-03T04:51:21Z

I can speak for the performance issues while enabling GC and using DJL on TF Java 0.0.2 -
The Deallocator thread makes GC spend 20 percent of the time in the cleanup, it is also a blocking thread.
And after a while there is a OOM

When using nopointergc=true - GC gets a lot of breathing room, no more continous churning in cleanup, but OOM happens a lot and a lot faster.

I move somewhere from 1 gigabyte a second to 15 gigabytes a second of data through inference.

I haven't tried the fix merged here as I've moved to use DJL and PyTorch which has no GC or memory leak issues so far.

Once the fix is released and DJl updates its TF dependencies Ill give it a try again as I have lots of use cases for DJL and Tensorflow.

saudet · 2021-03-03T07:47:27Z

In a choice between correctness and performance but random memory crashes, I will pick correctness. Also how much does this affect performance, and is this performance actually an issue on different JVMs (given that with the introduction of Java 9 cleaners it's likely to have received some optimisations)?

The JDK's implementation of the Cleaner doesn't do anything special. It just uses PhantomReference with a ReferenceQueue, and spins that in a thread. It's just for convenience:
https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java

DJL's main issue seemed to be the memory leak, and this PR fixes that. Their use of constant operations is a worry, but they should revert that and move back to using Tensor.

As @skirdey points out, I don't believe that's the case. We need to keep the conversation going with them instead of assuming that we know what's best for everyone!

TensorScope wouldn't work for cleaning up EagerOperation. It's a different handle. We could introduce an EagerOperationScope too, but how would we recommend people use that? Tensors have well defined lifetimes wrt a program, but defining the lifetime of an operation sounds much harder.

That paradigm works fine in C++ and Python where "scopes" like that come built-in with the language. Do you have an example of something that sounds hard to do?

There is no reference queue, it's not used. It's unnecessary baggage that only reduces performance:
https://github.com/bytedeco/javacpp/blob/1.5.4/src/main/java/org/bytedeco/javacpp/Pointer.java#L460-L468

Ok, so if the pointergc is turned off then it will leak memory if things aren't always enclosed by a pointer scope which is closed?

Yes, that's correct. We can also close pointers individually too.

I’m curious to know more about the performance drop when GC is enabled, @saudet do you have some metrics to share?

Also just wanted to put emphasis one more time that TF Java works totally fine with GC disabled, it’s just a matter of closing the eager sessions in time.

No, it won't work correctly the way things are right now. Please reread what I wrote above. I've taken a look back at how it was done for TF 1.x as well, and its implementation is also incorrect. With that one, even with GC, references can (and will) leak.

Once the fix is released and DJl updates its TF dependencies Ill give it a try again as I have lots of use cases for DJL and Tensorflow.

Would you have a small piece of code to share with us that faithfully demonstrates the issue? It would help make sure that we come up with a solution that everyone likes and that actually meets your needs.

skirdey · 2021-03-03T08:09:18Z

I've described the issue and the way to reproduce it by setting nopointergc=true while using stock DJL benchmark and gradle. You can also run same benchmark using PyTorch engine instead deepjavalibrary/djl#690

…

On Tuesday, 2 March 2021, Samuel Audet ***@***.***> wrote: In a choice between correctness and performance but random memory crashes, I will pick correctness. Also how much does this affect performance, and is this performance actually an issue on different JVMs (given that with the introduction of Java 9 cleaners it's likely to have received some optimisations)? The JDK's implementation of the Cleaner doesn't do anything special. It just uses PhantomReference with a ReferenceQueue, and spins that in a thread. It's just for convenience: https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/ share/classes/jdk/internal/ref/CleanerImpl.java DJL's main issue seemed to be the memory leak, and this PR fixes that. Their use of constant operations is a worry, but they should revert that and move back to using Tensor. As @skirdey <https://github.com/skirdey> points out, I don't believe that's the case. We need to keep the conversation going with them instead of assuming that we know what's best for everyone! TensorScope wouldn't work for cleaning up EagerOperation. It's a different handle. We could introduce an EagerOperationScope too, but how would we recommend people use that? Tensors have well defined lifetimes wrt a program, but defining the lifetime of an operation sounds much harder. That paradigm works fine in C++ and Python where "scopes" like that come built-in with the language. Do you have an example of something that sounds hard to do? There is no reference queue, it's not used. It's unnecessary baggage that only reduces performance: https://github.com/bytedeco/javacpp/blob/1.5.4/src/main/ java/org/bytedeco/javacpp/Pointer.java#L460-L468 Ok, so if the pointergc is turned off then it will leak memory if things aren't always enclosed by a pointer scope which is closed? Yes, that's correct. We can also close pointers individually too. I’m curious to know more about the performance drop when GC is enabled, @saudet <https://github.com/saudet> do you have some metrics to share? Also just wanted to put emphasis one more time that TF Java *works totally fine with GC disabled*, it’s just a matter of closing the eager sessions in time. No, it won't work correctly the way things are right now. Please reread what I wrote above. I've taken a look back at how it was done for TF 1.x as well, and its implementation is also incorrect. With that one, even with GC, references can (and will) leak. Once the fix is released and DJl updates its TF dependencies Ill give it a try again as I have lots of use cases for DJL and Tensorflow. Would you have a small piece of code to share with us that faithfully demonstrates the issue? It would help make sure that we come up with a solution that everyone likes *and* that actually meets your needs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#229 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHIJQAVDTRGPTT4D5BZ35LTBXSR5ANCNFSM4YL2G5XQ> .

saudet · 2021-03-03T08:23:08Z

I've described the issue and the way to reproduce it by setting nopointergc=true while using stock DJL benchmark and gradle. You can also run same benchmark using PyTorch engine instead awslabs/djl#690

Thanks! Missed the bit about that being the standard benchmark from DJL. Would you have something that uses TF only though?

skirdey · 2021-03-03T08:27:05Z

Not at the moment, I haven't used TF Java outside of DJL

…

On Wednesday, 3 March 2021, Samuel Audet ***@***.***> wrote: I've described the issue and the way to reproduce it by setting nopointergc=true while using stock DJL benchmark and gradle. You can also run same benchmark using PyTorch engine instead deepjavalibrary/djl#690 <deepjavalibrary/djl#690> Thanks! Missed the bit about that being the standard benchmark from DJL. Would you have something that uses TF only though? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#229 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHIJQAGK262JEGC4M7EUETTBXWXZANCNFSM4YL2G5XQ> .

karllessard · 2021-03-03T13:06:09Z

The JDK's implementation of the Cleaner doesn't do anything special. It just uses PhantomReference with a ReferenceQueue, and spins that in a thread.

The thread does not seem to be ~~blocking~~ synchronized though.

So when the synchronization happens in JavaCPP, when it frees the memory? Could that be prevented or the trick is to deallocate less often, in chunks (scopes), rather than multiple small pieces (which is what the GC would do)?

Craigacp · 2021-03-03T14:26:15Z

In a choice between correctness and performance but random memory crashes, I will pick correctness. Also how much does this affect performance, and is this performance actually an issue on different JVMs (given that with the introduction of Java 9 cleaners it's likely to have received some optimisations)?

The JDK's implementation of the Cleaner doesn't do anything special. It just uses PhantomReference with a ReferenceQueue, and spins that in a thread. It's just for convenience:
https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java

Yes, but idioms inside the JDK tend to receive performance optimisations from the JVM which might not be visible at the class file level. For example that CleanerImpl class references InnocuousThread which is a JDK internal class that might be treated differently by the runtime. I don't know if this is the case, but I do know that the GC has received a lot of attention in versions after Java 8 and so it's behaviour and associated overheads might be lower.

TensorScope wouldn't work for cleaning up EagerOperation. It's a different handle. We could introduce an EagerOperationScope too, but how would we recommend people use that? Tensors have well defined lifetimes wrt a program, but defining the lifetime of an operation sounds much harder.

That paradigm works fine in C++ and Python where "scopes" like that come built-in with the language. Do you have an example of something that sounds hard to do?

Well in Python TF they don't scope the eager operations, so we should figure out what the lifetime of them is and see if there is a natural mapping to Java. But more generally it's easy to say when a tensor is being referenced and thus what it's lifetime is. With an eager operation should we scope them at a method level (e.g. if I have a method that implements a single layer of a ResNet, should we scope the ops to that and have them be closed in a try-with-resources on method exit), or does that cause other performance issues when run inside a loop (e.g. to build up the rest of the ResNet). I don't know the answers, and we should try to figure that out before introducing scoping on something.

I’m curious to know more about the performance drop when GC is enabled, @saudet do you have some metrics to share?
Also just wanted to put emphasis one more time that TF Java works totally fine with GC disabled, it’s just a matter of closing the eager sessions in time.

No, it won't work correctly the way things are right now. Please reread what I wrote above. I've taken a look back at how it was done for TF 1.x as well, and its implementation is also incorrect. With that one, even with GC, references can (and will) leak.

Where's the leak? I went through and couldn't see how a reference could escape the set, and with the pointer gc on it will catch and deallocate everything. I agree with it turned off then this leaks, but again we could just tell people that's not supported.

saudet · 2021-03-04T05:43:46Z

The JDK's implementation of the Cleaner doesn't do anything special. It just uses PhantomReference with a ReferenceQueue, and spins that in a thread.

The thread does not seem to be ~~blocking~~ synchronized though.

Like I said at #208 (comment) and a bunch of other places, it will not block if you set maxBytes and maxPhysicalBytes to 0.
Or said another way: If you don't like blocking, do like DJL and set them to 0 by default! Sounds good?

Access to the linked list used by the Cleaner is synchronized, yes:
https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/jdk/internal/ref/PhantomCleanable.java
That can block too, but it's not usually a problem.

So when the synchronization happens in JavaCPP, when it frees the memory? Could that be prevented or the trick is to deallocate less often, in chunks (scopes), rather than multiple small pieces (which is what the GC would do)?

Like I keep telling you guys, thinking about GC is a dead end. Please stop thinking about anything related to GC!

Yes, but idioms inside the JDK tend to receive performance optimisations from the JVM which might not be visible at the class file level. For example that CleanerImpl class references InnocuousThread which is a JDK internal class that might be treated differently by the runtime. I don't know if this is the case, but I do know that the GC has received a lot of attention in versions after Java 8 and so it's behaviour and associated overheads might be lower.

Yeah, ok, but it still doesn't make GC work any better for native resources. Panama has been having problems with this for over 5 years now, and the JDK in general for over 20 years (!!), and they've just started to get a grip on the situation, whereas PointerScope from JavaCPP has been working fine since 2018: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/ It's by no means perfect, but it gets the job done. It may cause the JVM to crash and what not, but so can TF Core. We could rewrite TF Core in Java, but we're not going to do this, right? ... What are we arguing about exactly?

Well in Python TF they don't scope the eager operations, so we should figure out what the lifetime of them is and see if there is a natural mapping to Java. But more generally it's easy to say when a tensor is being referenced and thus what it's lifetime is. With an eager operation should we scope them at a method level (e.g. if I have a method that implements a single layer of a ResNet, should we scope the ops to that and have them be closed in a try-with-resources on method exit), or does that cause other performance issues when run inside a loop (e.g. to build up the rest of the ResNet). I don't know the answers, and we should try to figure that out before introducing scoping on something.

Well, look, I keep telling you, I'm pretty sure DJL needs something like this, and for the kind of applications they are working on, as @skirdey found out and wrote at #208 (comment), something like PointerScope works just fine at least for some things. So, are you going to ignore this? If so, we can just stop discussing here, and I'll simply go help them get their stuff running with JavaCPP directly.

No, it won't work correctly the way things are right now. Please reread what I wrote above. I've taken a look back at how it was done for TF 1.x as well, and its implementation is also incorrect. With that one, even with GC, references can (and will) leak.

Where's the leak? I went through and couldn't see how a reference could escape the set, and with the pointer gc on it will catch and deallocate everything. I agree with it turned off then this leaks, but again we could just tell people that's not supported.

Right, delete() and deleteAll() appear synchronized enough that calling them concurrently should be fine. Looking more closely, I think the only issue is when close() doesn't get called, the context itself never gets deleted. Anyway, if we wanted to do something similar to that with JavaCPP, I'd synchronize the deallocate() method here:
https://github.com/tensorflow/java/blob/master/tensorflow-core/tensorflow-core-api/src/main/java/org/tensorflow/internal/c_api/AbstractTFE_Op.java#L30
That gets assigned to Pointer.deallocator, which we can retrieve with the protected deallocator() method from those subclasses, keep strong references to them, and call deallocate() manually on them when desired. That should work well regardless of whether noPointerGC is enabled or not.

Keep weak references to eager resources in session

ea7e614

karllessard mentioned this pull request Mar 1, 2021

Interaction between Tensorflow Java and JavaCPP Pointer deallocation #208

Open

karllessard mentioned this pull request Mar 1, 2021

Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true (CPU) deepjavalibrary/djl#690

Closed

Craigacp requested changes Mar 1, 2021

View reviewed changes

Add documentation and unit test

76e24d0

Craigacp approved these changes Mar 2, 2021

View reviewed changes

karllessard merged commit f6024dd into tensorflow:master Mar 2, 2021

karllessard deleted the eager-gc-fix branch March 2, 2021 04:18

saudet mentioned this pull request Mar 5, 2021

Add TensorScope #188

Closed

karllessard restored the eager-gc-fix branch April 6, 2021 22:44

Keep weak references to eager resources in session #229

Keep weak references to eager resources in session #229

Uh oh!

Conversation

karllessard commented Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karllessard commented Mar 1, 2021

Uh oh!

rnett commented Mar 1, 2021

Uh oh!

saudet commented Mar 1, 2021

Uh oh!

karllessard commented Mar 1, 2021

Uh oh!

Craigacp left a comment

Choose a reason for hiding this comment

Uh oh!

Craigacp Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

karllessard Mar 2, 2021

Choose a reason for hiding this comment

Uh oh!

Craigacp Mar 2, 2021

Choose a reason for hiding this comment

Uh oh!

karllessard commented Mar 2, 2021

Uh oh!

Craigacp left a comment

Choose a reason for hiding this comment

Uh oh!

karllessard commented Mar 2, 2021

Uh oh!

saudet commented Mar 2, 2021

Uh oh!

saudet commented Mar 2, 2021

Uh oh!

Craigacp commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Craigacp commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karllessard commented Mar 2, 2021

Uh oh!

saudet commented Mar 3, 2021

Uh oh!

Craigacp commented Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karllessard commented Mar 3, 2021

Uh oh!

karllessard commented Mar 3, 2021

Uh oh!

rnett commented Mar 3, 2021

Uh oh!

skirdey commented Mar 3, 2021

Uh oh!

saudet commented Mar 3, 2021

Uh oh!

skirdey commented Mar 3, 2021 via email

Uh oh!

saudet commented Mar 3, 2021

Uh oh!

skirdey commented Mar 3, 2021 via email

Uh oh!

karllessard commented Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Craigacp commented Mar 3, 2021

Uh oh!

saudet commented Mar 4, 2021

Uh oh!

Uh oh!

karllessard commented Mar 1, 2021 •

edited

Loading

Craigacp commented Mar 2, 2021 •

edited

Loading

Craigacp commented Mar 2, 2021 •

edited

Loading

Craigacp commented Mar 3, 2021 •

edited

Loading

karllessard commented Mar 3, 2021 •

edited

Loading