Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FFI client can spawn tasks faster than they can be executed leading to memory build-up #135

Closed
xlukem opened this issue May 8, 2024 · 19 comments

Comments

@xlukem
Copy link

xlukem commented May 8, 2024

Hey guys!
We are actively using the C++ bindings with your library and are very happy with it.

When we recently expanded our software to communicate with many devices at once, we noticed a significant steady increase in memory usage until the software crashes.

I ran our binary with the valgrind tool and got this as part of the result:

==31== 219,121,920 bytes in 285,315 blocks are still reachable in loss record 1,022 of 1,022
==31==    at 0x488A324: memalign (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
==31==    by 0x488A497: posix_memalign (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
==31==    by 0x4AC2D93: aligned_malloc (alloc.rs:98)
==31==    by 0x4AC2D93: alloc (alloc.rs:22)
==31==    by 0x4AC2D93: __rdl_alloc (alloc.rs:381)
==31==    by 0x494182F: tokio::runtime::task::raw::RawTask::new (in /usr/lib/rodbus/aarch64-unknown-linux-gnu/librodbus_ffi.so)
==31==    by 0x4963827: tokio::runtime::scheduler::multi_thread::handle::Handle::bind_new_task (in /usr/lib/rodbus/aarch64-unknown-linux-gnu/librodbus_ffi.so)
==31==    by 0x495001B: tokio::runtime::handle::Handle::spawn_named (in /usr/lib/rodbus/aarch64-unknown-linux-gnu/librodbus_ffi.so)
==31==    by 0x495ACBB: rodbus_client_channel_read_holding_registers (in /usr/lib/rodbus/aarch64-unknown-linux-gnu/librodbus_ffi.so)
==31==    by 0x2BEA77: rodbus::fn::client_channel_read_holding_registers(rodbus::ClientChannel&, rodbus::RequestParam const&, rodbus::AddressRange const&, std::unique_ptr<rodbus::RegisterReadCallback, std::default_delete<rodbus::RegisterReadCallback> >) (rodbus.cpp:1450)
==31==    by 0x2C267B: rodbus::ClientChannel::read_holding_registers(rodbus::RequestParam const&, rodbus::AddressRange const&, std::unique_ptr<rodbus::RegisterReadCallback, std::default_delete<rodbus::RegisterReadCallback> >) (rodbus.cpp:2562)
==31==    by 0x211073: modbus::modbusMaster::executeJob(int, modbus::unprotectedMbJob, std::vector<unsigned short, std::allocator<unsigned short> >&, modbus::itfMbJobActions*) (modbusMaster.hpp:111)
==31==    by 0x211E27: gateway::gateway::modbusJobRun() (gatewayBase.hpp:162)
==31==    by 0x211777: gateway::gateway::run() (gatewayBase.hpp:81)
==31==
==31== LEAK SUMMARY:
==31==    definitely lost: 0 bytes in 0 blocks
==31==    indirectly lost: 0 bytes in 0 blocks
==31==      possibly lost: 19,816 bytes in 176 blocks
==31==    still reachable: 537,989,313 bytes in 1,994,176 blocks
==31==         suppressed: 0 bytes in 0 blocks

A significant amount of memory is still occupied by the rodbus library although it is not needed anymore.
To me it looks like that the tokio runtime tasks never get deleted.

At the time of testing, we sent about 5000 Modbus Requests per second.
We also pass some arguments into the ModbusCallbacks. As those are just pointers or ints, this shouldnt be the cause of the problem. Other than that, I dont see anything we are doing differently than the examples.
Our timeout is set to 1 second and the max_queued_requests is set to 10.

Is this an issue on our end or needs the rodbus library to be adjusted?

Thank you!

@xlukem
Copy link
Author

xlukem commented May 8, 2024

We already found that when a modbus command function (like read_input_registers()) is called on the client, while it is still in the connecting state, the modbus callback doesnt get deleted.

Also, in our load test we have seen that thousands of requests and callbacks are created without being deleted, even when we are making sure the clients are in the connected state.
Do you have any more information about the the max_queued_requests parameter? Why arent our modbus requests failing, as we should be exceeding this limit pretty quickly?

@jadamcrain
Copy link
Member

jadamcrain commented May 8, 2024

@xlukem Thanks for this report. I'll have a look and see what I can come up with.

@jadamcrain
Copy link
Member

Can you confirm that you're using the latest release (1.3.1) and also tell me what OS you're using?

@xlukem
Copy link
Author

xlukem commented May 8, 2024

Yes, I can confirm that we are using the 1.3.1 release in a Debian based Docker Container on the Torizon OS Platform. We are using the aarch64-unknown-linux-gnu version of the library.

@xlukem
Copy link
Author

xlukem commented May 8, 2024

@jadamcrain
Copy link
Member

Good. Should be easy for me to replicate and figure out.

@jadamcrain
Copy link
Member

```We already found that when a modbus command function (like read_input_registers()) is called on the client, while it is still in the connecting state, the modbus callback doesnt get deleted.``

Inspecting the code, I do see that when connecting, all queue operations are deferred until after connect succeeds or fails.

With the statement about do you mean "callback doesn't get called right away" or do you literally mean that the unique_ptr is never deleted?

I am going to do a release that makes requests fail while the connect operation is processing to see how this changes things.

@xlukem
Copy link
Author

xlukem commented May 10, 2024

Yes, I have now seen that this part of the queue does work properly. The unique_ptr does get deleted once the connection fails or succeeds.

I think our problem might be limited to the scenario under load. I would assume that more modbus commands get created than our device is able to send. It doesnt look like any changes to the max_queued_requests parameter make a difference. Is the on_failure callback supposed to be called when the queue exceeds this limit?

@jadamcrain
Copy link
Member

jadamcrain commented May 10, 2024

It's a bit complicated. There are actually 2 queues:

  1. An async thread-safe queue that marshalls requests from the user thread to thread pool. The client task is popping this queue and then putting requests into 2) a local queue. This local queue is the one that respects max_queued_requests. So if you try to move another request into it, and it's at capacity, the request is failed w/ on_failure.

That said, I think the issue is that we're spawning requests faster than the tasks are being failed. I believe this is b/c the client task is currently not servicing 1) during a connect operation.

May I ask what you have the connection retry delays set to?

@xlukem
Copy link
Author

xlukem commented May 10, 2024

That said, I think the issue is that we're spawning requests faster than the tasks are being failed.

Yeah, that might be it. In the worst case, we will just have to make sure not to overload the queue.

I use the standard values for the RetryStrategy struct.

RetryStrategy.min_delay : 1000ms
RetryStrategy.max_delay : 10000ms

I once tried setting them both to 100ms with no notable difference.
In the high load scenario, no client enters the connect retry anyway.

@jadamcrain
Copy link
Member

Do you:

A) Schedule your polls on some timer in the main C++ thread

OR

B) Schedule the next poll in the callback from the previous poll?

I guess if the network is slow, the scheduling would pile up with the way the bindings currently work. I believe that I'd like to change this such that the async queue would also respect the max_queued_requests parameter and immediately fail a request if that queue is at capacity. That combined with failing requests while in the connecting state leads to much better behavior IMO and guarantees the library always runs within a maximum memory bound.

@jadamcrain
Copy link
Member

I think this also leads to an API that I'd like to add... one in which the library schedules the polls at a fixed frequency on your behalf whenever you are connected.

@jadamcrain jadamcrain changed the title memory issues with the c++ bindings FFI client can spawn tasks faster than they can be executed leading to memory build-up May 13, 2024
@jadamcrain
Copy link
Member

jadamcrain commented May 15, 2024

@xlukem Can you try out this milestone release?

https://github.com/stepfunc/rodbus/releases/tag/1.4.0-M1

All of the methods on the client will now instantly fail with the error too_many_requests if max_queued_requests are already queued up. This change ensures the library never uses more than a certain upper bound in memory.

@xlukem
Copy link
Author

xlukem commented May 15, 2024

A) Schedule your polls on some timer in the main C++ thread

Yes, thats what we did. Thats why the queue piled up pretty quickly.

We have already tried out the newest release and it works like a charm. Thank you very much!
Will requests now fail as well, when the client is still in the connecting state?

@jadamcrain
Copy link
Member

jadamcrain commented May 15, 2024

Yes. Requests are now immediately failed when the client is connecting.

I will adjust the release notes to explain this behavior change as well.

@jadamcrain
Copy link
Member

Fixed in #136 which will be incorporated into 1.4.0 release.

@xlukem
Copy link
Author

xlukem commented May 22, 2024

Great work! We really appreciate your support for the library.

We encountered one last thing we would like to discuss. With the current implementation we never know how full the queue is and want to make sure that certain messages are guaranteed to be executed at a certain point of time. Being able to lookup the queue length/status would enable us to dynamically fill the queue and always keep room for new messages, without provoking errors.

Is this a feature that could fit into the library?

@jadamcrain
Copy link
Member

I don't believe so. Another option would be to make the operations blocking:

https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.html#method.blocking_send

The problem with that is you could never call those methods from a callback b/c you can't call them from an async context.

I still think that the best thing to do in a following release is to have an API where you let the library schedule the polls on your behalf.

@jadamcrain
Copy link
Member

@xlukem FYI these changes are now in the 1.4.0 stable release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants