FFI client can spawn tasks faster than they can be executed leading to memory build-up #135

xlukem · 2024-05-08T06:54:13Z

Hey guys!
We are actively using the C++ bindings with your library and are very happy with it.

When we recently expanded our software to communicate with many devices at once, we noticed a significant steady increase in memory usage until the software crashes.

I ran our binary with the valgrind tool and got this as part of the result:

==31== 219,121,920 bytes in 285,315 blocks are still reachable in loss record 1,022 of 1,022
==31==    at 0x488A324: memalign (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
==31==    by 0x488A497: posix_memalign (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
==31==    by 0x4AC2D93: aligned_malloc (alloc.rs:98)
==31==    by 0x4AC2D93: alloc (alloc.rs:22)
==31==    by 0x4AC2D93: __rdl_alloc (alloc.rs:381)
==31==    by 0x494182F: tokio::runtime::task::raw::RawTask::new (in /usr/lib/rodbus/aarch64-unknown-linux-gnu/librodbus_ffi.so)
==31==    by 0x4963827: tokio::runtime::scheduler::multi_thread::handle::Handle::bind_new_task (in /usr/lib/rodbus/aarch64-unknown-linux-gnu/librodbus_ffi.so)
==31==    by 0x495001B: tokio::runtime::handle::Handle::spawn_named (in /usr/lib/rodbus/aarch64-unknown-linux-gnu/librodbus_ffi.so)
==31==    by 0x495ACBB: rodbus_client_channel_read_holding_registers (in /usr/lib/rodbus/aarch64-unknown-linux-gnu/librodbus_ffi.so)
==31==    by 0x2BEA77: rodbus::fn::client_channel_read_holding_registers(rodbus::ClientChannel&, rodbus::RequestParam const&, rodbus::AddressRange const&, std::unique_ptr<rodbus::RegisterReadCallback, std::default_delete<rodbus::RegisterReadCallback> >) (rodbus.cpp:1450)
==31==    by 0x2C267B: rodbus::ClientChannel::read_holding_registers(rodbus::RequestParam const&, rodbus::AddressRange const&, std::unique_ptr<rodbus::RegisterReadCallback, std::default_delete<rodbus::RegisterReadCallback> >) (rodbus.cpp:2562)
==31==    by 0x211073: modbus::modbusMaster::executeJob(int, modbus::unprotectedMbJob, std::vector<unsigned short, std::allocator<unsigned short> >&, modbus::itfMbJobActions*) (modbusMaster.hpp:111)
==31==    by 0x211E27: gateway::gateway::modbusJobRun() (gatewayBase.hpp:162)
==31==    by 0x211777: gateway::gateway::run() (gatewayBase.hpp:81)
==31==
==31== LEAK SUMMARY:
==31==    definitely lost: 0 bytes in 0 blocks
==31==    indirectly lost: 0 bytes in 0 blocks
==31==      possibly lost: 19,816 bytes in 176 blocks
==31==    still reachable: 537,989,313 bytes in 1,994,176 blocks
==31==         suppressed: 0 bytes in 0 blocks

A significant amount of memory is still occupied by the rodbus library although it is not needed anymore.
To me it looks like that the tokio runtime tasks never get deleted.

At the time of testing, we sent about 5000 Modbus Requests per second.
We also pass some arguments into the ModbusCallbacks. As those are just pointers or ints, this shouldnt be the cause of the problem. Other than that, I dont see anything we are doing differently than the examples.
Our timeout is set to 1 second and the max_queued_requests is set to 10.

Is this an issue on our end or needs the rodbus library to be adjusted?

Thank you!

The text was updated successfully, but these errors were encountered:

xlukem · 2024-05-08T13:26:06Z

We already found that when a modbus command function (like read_input_registers()) is called on the client, while it is still in the connecting state, the modbus callback doesnt get deleted.

Also, in our load test we have seen that thousands of requests and callbacks are created without being deleted, even when we are making sure the clients are in the connected state.
Do you have any more information about the the max_queued_requests parameter? Why arent our modbus requests failing, as we should be exceeding this limit pretty quickly?

jadamcrain · 2024-05-08T13:52:33Z

@xlukem Thanks for this report. I'll have a look and see what I can come up with.

jadamcrain · 2024-05-08T15:54:47Z

Can you confirm that you're using the latest release (1.3.1) and also tell me what OS you're using?

xlukem · 2024-05-08T16:06:41Z

Yes, I can confirm that we are using the 1.3.1 release in a Debian based Docker Container on the Torizon OS Platform. We are using the aarch64-unknown-linux-gnu version of the library.

xlukem · 2024-05-08T16:09:13Z

torizon/debian:3.2.1-bookworm

jadamcrain · 2024-05-08T16:44:09Z

Good. Should be easy for me to replicate and figure out.

jadamcrain · 2024-05-09T22:44:13Z

```We already found that when a modbus command function (like read_input_registers()) is called on the client, while it is still in the connecting state, the modbus callback doesnt get deleted.``

Inspecting the code, I do see that when connecting, all queue operations are deferred until after connect succeeds or fails.

With the statement about do you mean "callback doesn't get called right away" or do you literally mean that the unique_ptr is never deleted?

I am going to do a release that makes requests fail while the connect operation is processing to see how this changes things.

xlukem · 2024-05-10T00:43:53Z

Yes, I have now seen that this part of the queue does work properly. The unique_ptr does get deleted once the connection fails or succeeds.

I think our problem might be limited to the scenario under load. I would assume that more modbus commands get created than our device is able to send. It doesnt look like any changes to the max_queued_requests parameter make a difference. Is the on_failure callback supposed to be called when the queue exceeds this limit?

jadamcrain · 2024-05-10T01:36:58Z

It's a bit complicated. There are actually 2 queues:

An async thread-safe queue that marshalls requests from the user thread to thread pool. The client task is popping this queue and then putting requests into 2) a local queue. This local queue is the one that respects max_queued_requests. So if you try to move another request into it, and it's at capacity, the request is failed w/ on_failure.

That said, I think the issue is that we're spawning requests faster than the tasks are being failed. I believe this is b/c the client task is currently not servicing 1) during a connect operation.

May I ask what you have the connection retry delays set to?

xlukem · 2024-05-10T20:55:25Z

That said, I think the issue is that we're spawning requests faster than the tasks are being failed.

Yeah, that might be it. In the worst case, we will just have to make sure not to overload the queue.

I use the standard values for the RetryStrategy struct.

RetryStrategy.min_delay : 1000ms
RetryStrategy.max_delay : 10000ms

I once tried setting them both to 100ms with no notable difference.
In the high load scenario, no client enters the connect retry anyway.

jadamcrain · 2024-05-10T22:15:16Z

Do you:

A) Schedule your polls on some timer in the main C++ thread

OR

B) Schedule the next poll in the callback from the previous poll?

I guess if the network is slow, the scheduling would pile up with the way the bindings currently work. I believe that I'd like to change this such that the async queue would also respect the max_queued_requests parameter and immediately fail a request if that queue is at capacity. That combined with failing requests while in the connecting state leads to much better behavior IMO and guarantees the library always runs within a maximum memory bound.

jadamcrain · 2024-05-10T22:31:18Z

I think this also leads to an API that I'd like to add... one in which the library schedules the polls at a fixed frequency on your behalf whenever you are connected.

jadamcrain · 2024-05-15T01:24:44Z

@xlukem Can you try out this milestone release?

https://github.com/stepfunc/rodbus/releases/tag/1.4.0-M1

All of the methods on the client will now instantly fail with the error too_many_requests if max_queued_requests are already queued up. This change ensures the library never uses more than a certain upper bound in memory.

xlukem · 2024-05-15T18:28:39Z

A) Schedule your polls on some timer in the main C++ thread

Yes, thats what we did. Thats why the queue piled up pretty quickly.

We have already tried out the newest release and it works like a charm. Thank you very much!
Will requests now fail as well, when the client is still in the connecting state?

jadamcrain · 2024-05-15T18:31:47Z

Yes. Requests are now immediately failed when the client is connecting.

I will adjust the release notes to explain this behavior change as well.

jadamcrain · 2024-05-16T06:11:58Z

Fixed in #136 which will be incorporated into 1.4.0 release.

xlukem · 2024-05-22T09:47:01Z

Great work! We really appreciate your support for the library.

We encountered one last thing we would like to discuss. With the current implementation we never know how full the queue is and want to make sure that certain messages are guaranteed to be executed at a certain point of time. Being able to lookup the queue length/status would enable us to dynamically fill the queue and always keep room for new messages, without provoking errors.

Is this a feature that could fit into the library?

jadamcrain · 2024-05-23T15:43:05Z

I don't believe so. Another option would be to make the operations blocking:

https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.html#method.blocking_send

The problem with that is you could never call those methods from a callback b/c you can't call them from an async context.

I still think that the best thing to do in a following release is to have an API where you let the library schedule the polls on your behalf.

jadamcrain · 2024-06-24T21:04:18Z

@xlukem FYI these changes are now in the 1.4.0 stable release.

jadamcrain mentioned this issue May 13, 2024

Modify FFI client methods to avoid task spawning #136

Merged

jadamcrain changed the title ~~memory issues with the c++ bindings~~ FFI client can spawn tasks faster than they can be executed leading to memory build-up May 13, 2024

jadamcrain closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FFI client can spawn tasks faster than they can be executed leading to memory build-up #135

FFI client can spawn tasks faster than they can be executed leading to memory build-up #135

xlukem commented May 8, 2024 •

edited

Loading

xlukem commented May 8, 2024

jadamcrain commented May 8, 2024 •

edited

Loading

jadamcrain commented May 8, 2024

xlukem commented May 8, 2024

xlukem commented May 8, 2024

jadamcrain commented May 8, 2024

jadamcrain commented May 9, 2024

xlukem commented May 10, 2024

jadamcrain commented May 10, 2024 •

edited

Loading

xlukem commented May 10, 2024

jadamcrain commented May 10, 2024

jadamcrain commented May 10, 2024

jadamcrain commented May 15, 2024 •

edited

Loading

xlukem commented May 15, 2024

jadamcrain commented May 15, 2024 •

edited

Loading

jadamcrain commented May 16, 2024

xlukem commented May 22, 2024

jadamcrain commented May 23, 2024

jadamcrain commented Jun 24, 2024

FFI client can spawn tasks faster than they can be executed leading to memory build-up #135

FFI client can spawn tasks faster than they can be executed leading to memory build-up #135

Comments

xlukem commented May 8, 2024 • edited Loading

xlukem commented May 8, 2024

jadamcrain commented May 8, 2024 • edited Loading

jadamcrain commented May 8, 2024

xlukem commented May 8, 2024

xlukem commented May 8, 2024

jadamcrain commented May 8, 2024

jadamcrain commented May 9, 2024

xlukem commented May 10, 2024

jadamcrain commented May 10, 2024 • edited Loading

xlukem commented May 10, 2024

jadamcrain commented May 10, 2024

jadamcrain commented May 10, 2024

jadamcrain commented May 15, 2024 • edited Loading

xlukem commented May 15, 2024

jadamcrain commented May 15, 2024 • edited Loading

jadamcrain commented May 16, 2024

xlukem commented May 22, 2024

jadamcrain commented May 23, 2024

jadamcrain commented Jun 24, 2024

xlukem commented May 8, 2024 •

edited

Loading

jadamcrain commented May 8, 2024 •

edited

Loading

jadamcrain commented May 10, 2024 •

edited

Loading

jadamcrain commented May 15, 2024 •

edited

Loading

jadamcrain commented May 15, 2024 •

edited

Loading