Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLE Throughput #15171

Closed
hwagner66 opened this issue Apr 3, 2019 · 17 comments
Closed

BLE Throughput #15171

hwagner66 opened this issue Apr 3, 2019 · 17 comments
Assignees

Comments

@hwagner66
Copy link

After implementing a throughput test for BLE using nrf52840 hardware (BL654-USB) and Zephyr 1.14-rc1, notice some curious behavior in data we are collecting. The test seems to show that sending data from the central (client) to peripheral (server) throughput performance is relatively balanced per connection but one peripheral seems to get little data and one peripheral get more. In data flow from the peripherals to the central, it is much more unbalanced with one or two peripherals having all the throughput and the others having none.

The central to peripheral is 1:N where the central is a single point source. The many peripherals to central is an N:1 with many sources. In my case N is 1 up to 5, but would like to be greater. The overall aggregate throughput is constant, just not evenly distributed among the BLE devices or comm directions. Using DLE and 2M PHY.

Any thoughts to why this behavior is present and how to diagnose/resolve to produce consistent and balanced throughput among the BLE devices? Do I have something misconfigured in my devices? Can share if that will help.

@hwagner66
Copy link
Author

In this test, all of the devices are Laird BL654-USB devices built around the nrf52840 and an FTDI UART-USB chip. The devices are all plugged into a USB hub for testing.
One device operates as the BLE central and is a client (call it dev0). The remaining 5 devices are BLE peripherals and act as servers (dev1-5). I use Zephyr 1.14-rc1 enabling the Shell and BLE devices. Shell implements commands via UART0 and USB vComm port to Putty window or Python script. Communication operates at 115.2 baud.

Currently the central has an address list of peripherals and connections are initiated for each peripheral via shell commands to the central. After connecting a larger MTU (247) is negotiated. The peripheral devices also have shell cmds implemented, so a predefined value can be written to a pair of characteristics to exchange data between devices. (The intent of this device is a serial bus replacement)
The exchange is in both directions simultaneously for 1:N devices.

Tested throughput with different connection intervals of 50mS, 100mS and 250mS. without significant changes in behavior. The connection peripheral parameters were set to min and max conn intervals of 40 (50mS), 80 (100mS) or 200 (250mS), slave latency of 0 and peripheral timeout of 40 (400mS). All devices had the same values for a given test. In the central, characteristics are written using bt_gatt_write_without_response_cb. In the peripheral, bt_gatt_notify is utilized to write data to the characteristic for the central to consume.

Attached is data from several different tests that were run. The red highlight differences from Test 1 (baseline).
Throughput data.pdf

@cvinayak
Copy link
Contributor

cvinayak commented Apr 8, 2019

@hwagner66 The central scheduling would place the slave connections of similar connection interval in a group next to each other with a minimum possibility of one default sized PDU exchange. When you have 251 bytes PDU and n peripheral connection, there will be chance for only the central to send a single PDU. The peripheral does not have enough on air time before the next peripheral connection is scheduled. Only the last scheduled peripheral in the connection interval get all the air time, hence reflecting higher throughput on the last connection (as reflected in your Test 1)

This is an implementation detail and there is no interface to tune the on air quality of service when PDU lengths and PHY is updated.

The required implementation change is not difficult, but I would like to collaborate with you by means of a conference call to be able to implement HCI connection event values that suits your requirement.

@carlescufi
Copy link
Member

From @Vudentz in the mailing list:

I wonder if this could be related to the lack of buffers on the host, I assume you would use bt_gatt_notify with NULL conn which will take care of notifying each connection but that may end up consuming all the buffers blocking which could the unbalanced since the connection that first appear on the list would probably never block.
You could perhaps try to increase the buffer pool with:
CONFIG_BT_L2CAP_TX_BUF_COUNT 5
If that solves it we may have to consider making the default the number of connection + 1 so we can actually emit notifications to all connection at once, though that doesn't guarantee there is always going to be buffer available if there are more traffic going on.

@hwagner66
Copy link
Author

@cvinayak @carlescufi Can we setup a conference call this week to discuss potential resolution of this issue? I am meeting with the customer Friday and would like to have a potential plan in place before then.

@cvinayak
Copy link
Contributor

cvinayak commented May 2, 2019

@hwagner66 any follow up on this issue?

@hwagner66
Copy link
Author

hwagner66 commented May 3, 2019 via email

@hwagner66
Copy link
Author

After making the slot timing change, more testing was performed with various input data sets. All testing has been done with unreliable messaging for maximum throughput and better understanding of network performance. The input data is approximately 19.5kbps (averages to one 244 byte payload every 100mS). Using DLE and 2Mphy on Zephyr 1.14. The connection interval is set to 100mS.
The central device is configured to support up to 10 simultaneous connections always. The question I have is why do I see poorer throughput with <10 connections? The tput is best at 1, 5 and 10 connections but worse when other numbers of devices are connected for the test. Is this an issue of scheduling the packets to communicate within each connection interval, not enough buffers in the central, or something else? I've attached a graph of performance vs active connections and the prj.conf for the central device.
BLE throughput vs connections.pdf
prj.conf.txt

@cvinayak
Copy link
Contributor

@hwagner66 Could you let me know the value of CONFIG_BT_CTLR_TX_BUFFERS? Are you able to use any sort of fairness in enqueuing the transfer across n connections?

My guesses are, the connections dont get enqueued packets within the 100 ms connection interval.

Could you use, say 13 tx buffers, and then every 100ms fairly enqueue 1 packet each to the n connections? (Do try to detect latency of bt_gatt_notify API call for any delays due to on-air retransmissions, if any).

@hwagner66
Copy link
Author

hwagner66 commented May 24, 2019 via email

@hwagner66
Copy link
Author

I am retrying the test with 13 tx buffers and using bt_gatt_notify to throttle the enqueing of msgs. Will let you know of results.
Any other thoughts come to mind?

@cvinayak
Copy link
Contributor

@hwagner66 i am trying to fix my broken shell app before I start trying to reproduce the behaviors you are seeing. Please bare with the delay.

@hwagner66
Copy link
Author

hwagner66 commented May 28, 2019 via email

@hwagner66
Copy link
Author

@cvinayak Modified central to use CONFIG_BT_CTLR_TX_BUFFERS=13 and counting semaphore on tx buf msgs. Performance was nearly identical to prior test using CONFIG_BT_CTLR_TX_BUFFERS=19 and no counting semaphore. Both tests used uniform random message sizes and 100mS connection interval. See graph in attached PDF (yellow latest test vs blue prior test).
BLE throughput vs connections 13Tx buf sem.pdf

@hwagner66
Copy link
Author

@cvinayak Any thoughts or progress on this issue? If there is something more I should look at in my code or configuration, or to check behavior in the stack, please let me know. I need to give an update to my customer today and any progress would be helpful in that regard.

@joerchan joerchan removed their assignment Jun 12, 2019
@cvinayak
Copy link
Contributor

@hwagner66 I have done some changes #17097 in controller implementation related to connection event length. Is it possible for you to share the test procedure/script so that I can try to reproduce your observations? You could also help me test your scenarios at your end.

@hwagner66
Copy link
Author

hwagner66 commented Jun 28, 2019 via email

cvinayak added a commit to cvinayak/zephyr that referenced this issue Jul 15, 2019
Fix the controller implementation to perform connection
event length reservation based on the completed Data Length
Update and/or PHY Update Procedure.

This fix with avoid states/roles from stepping on each
others event length. Connection would have supervision timed
out or have stalled data transmissions due to insufficient
reserved air time.

Relates to zephyrproject-rtos#15171.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
carlescufi pushed a commit that referenced this issue Jul 16, 2019
Fix the controller implementation to perform connection
event length reservation based on the completed Data Length
Update and/or PHY Update Procedure.

This fix with avoid states/roles from stepping on each
others event length. Connection would have supervision timed
out or have stalled data transmissions due to insufficient
reserved air time.

Relates to #15171.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
cvinayak added a commit to cvinayak/zephyr that referenced this issue Jul 23, 2019
Fix the controller implementation to perform connection
event length reservation based on the completed Data Length
Update and/or PHY Update Procedure.

This fix with avoid states/roles from stepping on each
others event length. Connection would have supervision timed
out or have stalled data transmissions due to insufficient
reserved air time.

Relates to zephyrproject-rtos#15171.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
nashif pushed a commit that referenced this issue Aug 14, 2019
Fix the controller implementation to perform connection
event length reservation based on the completed Data Length
Update and/or PHY Update Procedure.

This fix with avoid states/roles from stepping on each
others event length. Connection would have supervision timed
out or have stalled data transmissions due to insufficient
reserved air time.

Relates to #15171.

Signed-off-by: Vinayak Kariappa Chettimada <vich@nordicsemi.no>
@aescolar
Copy link
Member

It would seem this is now resolved, please reopen if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants