-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large latency on Tensor allocation #313
Comments
We should probably turn off the JavaCPP memory allocation size check and let the GC clear it up, if you're creating lots of Java side objects GC should run frequently enough to clear things out anyway. Are you hitting the thread.sleep inside JavaCPP's allocator? Note if you're making lots of sessions out of saved model bundles you might be hitting some memory leaks inside TF's C API (which the TF Java API uses to interact with the native library). I opened this issue (tensorflow/tensorflow#48802) upstream but we've not had a response yet. |
@nebulorum If you see some output when running your app with something like shown at #251 (comment), then it means you're forgetting to call |
The synchronized block in Pointer is going to cause contention on the thread, especially if you're allocating from different threads on different CPUs (or worse still sockets) as it'll cause the object to bounce around. I'm not sure that 200 requests per second is enough to trigger real issues there, but the fact that the Pointer is using a linked list (which is also not particularly multicore friendly) could also be causing trouble. What's the execution environment like (e.g. num CPUs, type of CPUs, num NUMA nodes, is the JVM restricted to a subset thereof)? |
We are running on AWS only CPU, no co-processors. What has baffled us it that 1.3.1 has very stable latency and the code is 3 year old. This new endpoint can be faster, but degrades really quickly as load increases. I have considered having a Tensor allocating thread so that we have fewer thread fighting for the block. But this will obviously create a bottleneck. |
@saudet , if I understand correctly, this synchronization will happen every time a resource is deallocated, no matter if the allocated memory is getting high or not. It would be interesting to disable completely the feature where JavaCPP keeps track of the amount of memory allocated and avoid this thread locking (looks like it is not avoidable here, even if you set Since that's a protected method, I can override it and run a test to see if there is any gain. @nebulorum , do you have any code available to share to run this benchmark? |
Ok, how big is your AWS instance? If there are NUMA effects from multiple CPU sockets (or within a socket for some of the AMD EPYC variants) then this could slow things down, and having a single thread doing the input tensor creation would help (though given there are lots of other things that create JavaCPP Pointers during a session.run call it won't prevent everything from bouncing the locking object across sockets). As Karl says, we should look at this locking on our side to see if we can improve things, but in the meantime pinning the threads if you're on a large machine might help. |
We checked in integration on if we call While we did have a face palm moment when you mentioned the Making sure people are aware they need to do proper resource management to ensure performance may be helpful. Specially since this is not the first issue around the topic. |
Sigh... sure, I can probably do that when the "org.bytedeco.javacpp.nopointergc" system property is set to "true" so that you can all stop blaming me for bugs in your code! :) No one has never ever encountered any issues like that with JavaCPP in the past. This is a problem in TF Java. |
Lock contention is a different issue to the freeing of resources. If threads are blocked waiting for the lock on |
So we added calling We are doing this load testing on AWS One other thing I was considering is preallocating and reusing the Tensor. We could put an upper bound on tensor length and maybe preallocate them. Not sure this would work on the runtime though. And thinking, is a bit optimistic because doing this logic would be hard. Also current implementation has no mutator for the tensors. |
You can write values directly into the tensors when they are cast to their appropriate type (e.g. A |
If I go down this route, is it possible to also update the shape of the Tensor? My thinking is given a modes we know how many tensor, and the maximum size. Allocate the the for the maximum size, set the data update the dimension. Not allocation needed. |
Tried to work with capacity to see the breaking point.
The composite below show JavaCCP and some of the 32 thread in different load levels (300, 200, 100 VU). The smaller the load the full column of sync action is less common. At higher load contention looks a lot worst. Not that the window tick on the graph as 2 seconds apart and light blue is state monitor. Event at low loads the sync appear. At 300 VU, K8S reported 6.27 CPUs of usage. We will try smaller thread pools. But there seems to be some interesting interplay on the JavaCPP thread. |
…edeco.javacpp.nopointergc" (issue tensorflow/java#313)
I've added a way to avoid any synchronized code at runtime in commit bytedeco/javacpp@d788390. Please give it a try with JavaCPP 1.5.6-SNAPSHOT and the "org.bytedeco.javacpp.nopointergc" system property set to "true". If you still see a "JavaCPP Deallocator" in your threads, that means the system property was not set early enough, so please try again, possibly with something like |
I tied these parameter and artefact and the "JavaCCP Deallocator" is no where to be found. But the behaviour did not improve. I also tested 8 threads and then it looks like synchronised swimming team :) On the positive side the machine does not to seem unstable and at 8 thread I have the same latency as 32 threads at 1000 virtual users. The Difference is the RPS, 8 -> 400 RPS, 32 -> 650 RPS. I'm considering testing just layer at a time to confirm where the problem is. But this will take sometime. Maybe we can construct an example of just looping at allocation and check if the contention appears. |
I created a small example that tries to allocated and close Tensors in several thread counts. I also added a simple latency check: https://gist.github.com/nebulorum/f7978aa5519cab8bece65d4dac689d4f The main loop allocates 80 Tensors base on a single array, sleep for 5 ms, then On my Mac Dual Core I5 (4 HT if I got correctly) with 32 threads, we get >2500 allocations per second, and occasionally allocation takes more than 200 ms.
With 4 worker threads you see allocation taking 50-100 ms:
On a single thread you can still see 50-100ms allocation but much less often:
If you remove the pause, a single thread goes to 2500 allocation per second, but you start seeing some slower threads but single occurrences. |
JavaCPP can allocate and deallocate 2 MB over |
I think there is contention on JavaCCP. Even with a single thread you will see latency spikes. But with a lot of thread and you unlucky 3% of your allocation take more than 50ms. There is event a drop in allocations per second. Under normal 50ms or greater latency tolerances this is ok. But in our use case we aim for less than 40ms. On Tensorflow Java 1.3.1 our P999 is under 40ms. With the new 0.3.1 we can't really even reach this on the P50 under load. The problem is not the average allocation time, but the outliers. The more threads you have the worst it becomes. If you look at the first example, you can see 32 thread taking more than 100ms to allocate and this happened 2 seconds apart. This could be interplay with GC too. Here is a single thread with no sleep:
You can see on the second row > 200ms wait and the allocations per second is 33% lower. As a reminder each allocation is 80 tensors. |
Right, ok, but the contention cannot happen in JavaCPP since I've removed all of that in commit bytedeco/javacpp@d788390. Something else in TF Java is happening that is unrelated to JavaCPP. |
I think there is still sync on the I did a run with the 1.5.6-SNAPSHOT and the The results are interesting:
Single Thread:
With 4 Threads (which also had a strange behaviour of taking 49 seconds before I saw allocations):
With 16 threads:
|
Just as a baseline I did only JVM allocation. GC goes crazy and we do see some contention, but at 10 X the allocations. private void allocLoop() {
long[] nd = new long[200];
for (var i = 0; i < 80; i++) {
long[] nd2 = new long[200];
}
Benchmarking is hard :) Either way just layering tensor allocation cost 10X. 1 threads:
16 threads:
|
We've been testing some more. We tried running our server without TensorFlow, but including data access to Cassandra. And same config can handle 380 RPS with 34 ms P99. So our time budget is tight, but if TF would add 4-5 ms we would be OK. For full transparency adding a 5ms sleep makes P99 goes up to 70ms, but this probably means we need went over the RPS capacity. We will clean up allocation logic to make it generate less garbage and takes less time. |
No, add() never gets called, remove() returns before entering the synchronized {} block, and nothing else is synchronized. |
I've replaced And I get the following results. I think that looks fine. Do you see anything wrong with those numbers?
|
That is an interesting find. That is kind of what I was expecting, low variance. Are the data types equivalent? Or is this showing something else inside the tensor. |
A mutex does get used here when collecting stats: |
We did not do any configuration on TensorFlow:
We follow the implementation of 1.3.1 on this, so maybe we got the wrong things. We have no TensorFlow specific config or flags, and on Dockerfile we just give some params to the JVM:
PS: Maybe the title of this ticket no longer makes sense. |
@dennisTS I've just shown above that this doesn't happen with 1.5.6-SNAPSHOT. |
….bytedeco.javacpp.nopointergc" (issue tensorflow/java#313)
Where are you setting the "org.bytedeco.javacpp.nopointergc" system property to "true"? There's one more place where synchronization could potentially take place when classes are unloaded, for the WeakHashMap used inside the call to sizeof() in Pointer.DeallocatorReference, but I have a hard time imagining in what kind of application that would actually matter. In any case, I've "fixed" that in commit bytedeco/javacpp@0ce97fc . |
We collected from our production config. If we can run the SNAPSHOT version in production we will do it. We need to release a version of this for some testing. But this was more to this question:
No I don't know how I would have turned on the Mutex. We are not using any additional configuration. Maybe we should make sure the mutex is not on. As your example showed if Tensors are not allocated performance is pretty stable. So maybe the issues is not JavaCPP. |
Using TF Java 0.3.1 and JavaCPP 1.5.6-SNAPSHOT with "noPointerGC", the original AllocateStress also looks OK to me:
If you're satisfied with JavaCPP 1.5.6-SNAPSHOT, but need a release, I can do a 1.5.5-1 or something with the fix, but before doing that please make sure that there isn't something else that you may want updated as well. Thanks for testing! |
Hi @saudet , thank you! We'll check it and get back to you |
We ran this (better @dennisTS did) in our test rig and things improved a lot: 600 RPS against 8CPU we are getting p99.9 =48ms and p95 = 24ms So real improvement using this version and option. As a follow-up what do we lose by using |
JavaCPP GC thread is mostly useful to collect unclosed resources in eager mode, which is mostly used for debugging than for high-performance in production. If you are running sessions from a saved model graph, you normally don't need an eager session and can live with |
@saudet , is there a way to completely turn off GC support in JavaCPP programmatically instead of passing a parameter to the command line? I guess we can define the value for this environment variable directly in TF Java before loading the TensorFlow runtime library but how to guarantee we can do it before JavaCPP libraries statically loads up? |
Now that we've been running for some hours in test we would really like to release to production. Will check with security if we can use the snapshot. But how long would it take to release a version of JavaCPP? |
Not really possible, unless TF Java is the only library using JavaCPP in the app, and if the user has other libraries using JavaCPP, it's probably not a good idea to tamper with global settings like that anyway. In any case, this is the kind of information that is useful for optimization, and should be part of the documentation on a page, for example, like those here: That said, I could add a parameter to
A couple of days, but like I said, let's make sure there isn't anything else we want to put in there... |
We've been running the SNAPSHOT for some days and it seems OK. Did the final version get release? |
You mean JavaCPP? No, do you need one? BTW, latency may be potentially even lower with TF Lite, so if your models are compatible with it, please give it a try: Those are currently wrappers for the C++ API, but if you would like to use an idiomatic high-level API instead, please let us know! |
@saudet, sorry for getting back to you with such a delay - somehow forgot about this thread
Well, it would be nice - using SNAPSHOT in production just doesn't feel good :)
We are using higher level APIs now; but generally even with the SNAPSHOT version of JavaCPP latencies are good for us |
System information
Describe the documentation issue
Not sure this is the correct forum but I would like to some guidance on how to setup sessions and resource management would be interesting.
After two weeks trying to understand why latencies in 0.3.1 were completely uncontrollable (as compared to official 1.3.1) I ran into #208. This matches my observations.
We are trying to run Prediction on models with thousands of data points in different tensor per prediction. Memory allocation on the threads are in 20MB/s and there seems to be a sync between JavaCCP Allocation thread and our worker threads. In addition to this allocation using Size(1) tensor seems to be very slow (in the 7ms range).
After reading #208, it seems we are doing everything wrong. But I don't really have a clear picture of how it should be done: Would
EagerSession
help? Could I use aSession
per HTTP request? Should I allocate larger multi-dimensional tensors instead of a single one? How should I configure thread pools? I understand that the API is work in progress, but current documentation is very light on this kind of documentation.I don't think this is a bug, but I can convert into some other sort of issue.
The text was updated successfully, but these errors were encountered: