-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long write operations does not seem to work correctly #135
Comments
Hello @maoueh |
@DariuszAniszewski the delay is not used in "production", it's more when I have to debug stuff with the actual device. Here the core of the packet writer I usually use:
In production code, the This is being tested on a LG G5 Android 7.0 phone. The receiving peripheral is backed by a Nordic nRF51822. I just tried adding a big delay (250ms and also 1s) just before each long operation:
The operation goes farther, I'm able to write about between 6 and 14 lines but at some point, operation hangs and transfer stops hitting the timeout exception. I know the transfer is happening since there is a LED blinking on the peripheral when transfer is in progress. Just not sure why the operation hangs like this. |
It may be possible that that you are experiencing a race condition that happens inside the library. I have seen it though I could not find the cause. If you could enable As a temporary workaround you could delay starting the next long write for about 10 milliseconds after the previous completed. |
I activated I tried adding a delay (250ms and 1s), but this changed nothing. In fact, like mentioned, I make the whole process go a bit farther (between 4 an 14 lines) but it still hangs. Do you want me to add verbose statement somewhere in RxAndroidBLE code? |
Nevermind, just understood that this is probably exactly what you wanted to check! Here the connect part of the log:
|
Uh, excuse me. You have already had the log. My mistake. If you could add a log in the
to something like
|
I added the log and tried other kind of operation. The problem seems to happen way more frequently when a bunch of characteristic changed event arrives after the long write operation has started. Is it possible they are interfering with the way the library check if the characteristic was written correctly? To add more context, here an overview of how of interact with the peripheral. We write some data to the peripheral and it responds through some notification. Here a different operation than flash that request some data from the device:
Note here that the data is written with success (
|
I added a log statement in
When it's not working, even though the Here a log when not working:
And one when it works:
|
Yup. It seems that this is the exact problem with a race condition. The |
In fact, now that you talked about the observer not being notified, I think I might have hit a similar (same?) problem when I first did the "flashing" implementation. At first, what I tried to do was to "subscribe" multiple time to My idea was that this would fill the queue and the "consumer" thread (the one in At first I thought it was a problem with threading on my part and went with writing next chunk only previous one did answer. But now that we talked about operation not being notified, I think that what happened at that time was the exact same problem as now. The notification was not making it's way to the observer (at that time the Feel free to ping me as I can easily reproduce the problem. I can try solution or give you some detailed log when working and when not working. Thanks for looking into this. |
I think one of the problem (I say one, see below) is that
And it did help since I now see all the
I see the |
This issue with no callback from the Android BLE Stack I have heard of - that is why all operations which are expecting a callback have some sort of a timeout. For this you could add a delay between writes and check if that will help. |
I don't think it comes from the Android BLE stack. I think it's an event ordering problem. Let's check the last snippet of log more closely (lightly modified to make fit easier in the screen):
Here, we see the last acknowledgement. Then, before So, somehow, before the Clearly some threading issue, not quite sure what is causing this however. |
Could you confirm by printing out the number of the batch which the log is referring to? Neverthless could you add a simple counter to be sure about which log is for which batch of the transmission? |
Here one log with batch sequence number. The The counter was added to
In last batch (i.e I will try the |
Well in fact, there is no |
I have created one - straight forward implementation. The potential fixes are now in the process of code review. I hope to merge them into the master before the end of the day (shortly) so you could test it with a snapshot. |
…hGatt.writeCharacteristic()` in `RxBleRadioOperationCharacteristicLongWrite`. #135 Reviewers: michal.zielinski, pawel.urban Reviewed By: pawel.urban Differential Revision: https://phabricator.polidea.com/D2181
… a `SerializedSubject`. #135 Reviewers: michal.zielinski, pawel.urban Reviewed By: pawel.urban Differential Revision: https://phabricator.polidea.com/D2180
@DariuszAniszewski I checked the subject change and I think you pick the wrong one. The one I think being called by multiple various thread is more likely the bluetoothGattBehaviorSubject which is used in all Gatt callbacks. Overall, I think all subject in this class should be reviewed as to where they are called by various different thread or not. If it's the case, they would need to be |
Most of the subjects used in the A new snapshot should be available. Could you try it out? |
I agree, but as you know, they are not thread-safe. Not having rapid changes does not remove the chance that a race condition or dead clock happens in the subject, less likely if there are not much changes but quite possible. To my eye, the That's why I think this one ( I will test your branch and report. |
Nop, it did not solve the issue (I changed a bit log statement since now we pinned point the problem):
|
So it seems that we should somehow synchronise the subscription to the |
Yeah I know, I thought about using some replay or caching logic, so that the "notification" is held until it can be used, but they would probably prove problematic in some cases. To work, we would need to cache event happening between just after the subscribe until they are processed ensuring no caching occurs outside the boundaries to avoid false positive. I tried a bit yesterday night but was not able to do so. I would need to dive deeper to understand at which moment stuff are being done. |
I am not in the office anymore. What if you could synchronise this method on
and in the |
The |
That would probably work indeed, will test that later today and report. |
It seemed it help as through various invocations, I never hit the event being out of order and execution goes farther (batch 109 was my best). However, it still hang at some point, seems we are chasing multiple problems here :) Log:
For batch 83, even if the I will need to dive deeper into Rx scheduler to understand what happen exactly. If you have ideas or want to try stuff, will be happy to check them out. |
I think it may still be the same issue. Adding the sychronisation only made the window smaller than before. Ehhh... As a friend of mine said: |
An idea just popped inside my head when already in bed:
to something like (exchange lambdas for anonymous classes ofc and pardon pseudocode):
This could help in always subscribing to the notification before writing the batch of data. |
…ng called even before `BluetoothGatt.writeCharacteristic()` returns. Summary: #135 Reviewers: pawel.urban Reviewed By: pawel.urban Differential Revision: https://phabricator.polidea.com/D2184
@maoueh I have pushed a potential fix for the race condition. You could test it once you will have a moment. |
@dariuszseweryn Yup, that fixed the issue, good job! I also tested by removing our "old" fixes like the So, the code as in the repository right now seem to work correctly for my use case. Out of scope, but two questions while I have you. Would you guys accept a PR that would enable the library to execute radio operation out of the main thread? We have a fork where we execute off main thread for most phone (we kept main thread only when manufacturer is Samsung and Android version is 4.3)? Would you guys accept a PR that would enable to execute my own radio operation? The idea here would be to avoid going on & off of the RxRadio queue. The PR could look like:
Of course, this a JavaDoc would explain that usage of the method should be made upon great care. Anyway, all in all, thanks for the great support! |
Glad that one solution finally helped. I will revert all unneeded changes tomorrow. As for the main thread versus background - I am a bit afraid that there may be more devices that are more stable while using the main thread than we know of. In this situation we would need to keep a list of devices that need a specific solution - which could be a burden. About the custom operation - I see no particular reason to do it like this. I was thinking about just exposing the |
I have added some optimisations to the long write operations. You can check it out. Should use a lot less allocations. Best Regards |
@dariuszseweryn Ok good, I check the changes, don't see a reason for them not to work. But I will test the latest branch nonetheless. As for the two questions, my use case is simply to optimize the flash firmware operation. With all optimizations in (scheduler off main thread, single operation in the queue), I was able to go from 140s in the worst case (heavy-lifted screen, balanced priority connection) to only 39s in my last try (high priority connection, scheduler off main thread, one operation in the queue). I will open two issues to keep discussion at the right place for them. I will provide my answer there. |
Summary
I tested the new long write operation commit that has been pushed recently. Tried it in my application and it crashes after the second "long" write operation.
Let me explain a little more my use case. I'm currently sending a rather big file to a BLE device. The file is a firmware in Intel HEX forma. From the defined application exchange protocol, I'm iterating through each line, wrap the line in a application packet format (starts and ends with a predefined byte value), chunk that into multiple packets (20 bytes) and write them one by one to the device by using
writeCharacteristic
. The callback is then used to trigger the next packet once a line has been fully written, I start sending the next one.I wanted to test the new long operation as it greatly reduce amount of operation on the queue since a full line using long write operation would used a single operation on the radio queue.
I tried it with the following code:
Where the
onLineWritten
simply triggers the next line writing operation.Preconditions
Steps to reproduce actual result
Not clear, but here what I think.
Actual result
There is crash happening, here the log part, the characteristic being written to is the one starting with
b4520104
and theb4520107
is some kind of ACK channel that we don't really use.The second line operation is hanging until a timeout occurs.
Expected result
I think this code should work. Since I write the next line only on the
onComplete
callback, there is no reason the next line can not be written correctly and that a timeout occurs.Note that a similar code without using long write operation does work correctly:
Where
mPacketWriter
is a custom chunker that creates all the chunks to write and thenconcatMap
them using themBleConnection.writeCharacteristic
observable as the "synchronization" point.The text was updated successfully, but these errors were encountered: