Skip to content

Conversation

@m3d
Copy link
Member

@m3d m3d commented Jul 28, 2020

This is a short report of test of zmqrouter on Eduro robot. It works (almost) with two isses:

  • start timeout 1s is insufficient (10s is OK)
  • slots are no longer supported but Eduro depends on it (250ms watchdog on CAN motors).

It moves and resets motors all the time.

RUNNING 1
SWITCH TO OPERATION 2
2020-07-28 18:03:57,396 osgar.zmqrouter ERROR    failed to stop within timeout of 0:00:02, exiting anyway
2020-07-28 18:03:57,396 osgar.zmqrouter ERROR    b'can' queue: 263
2020-07-28 18:03:57,397 osgar.zmqrouter ERROR    b'serial' queue: 38
2020-07-28 18:03:57,398 osgar.zmqrouter ERROR    b'estop_serial' queue: 1
2020-07-28 18:03:57,398 osgar.zmqrouter ERROR    b'eduro' queue: 462
2020-07-28 18:03:57,399 osgar.zmqrouter ERROR    b'lidar' queue: 1
2020-07-28 18:03:57,399 osgar.zmqrouter ERROR    b'camera' queue: 1
2020-07-28 18:03:57,400 osgar.zmqrouter ERROR    b'lora' queue: 2
2020-07-28 18:03:57,400 osgar.zmqrouter ERROR    b'app' queue: 55
2020-07-28 18:03:57,401 osgar.zmqrouter ERROR    b'detector' queue: 1
2020-07-28 18:03:57,401 osgar.zmqrouter ERROR    still running: {b'can'}
2020-07-28 18:03:57,402 osgar.zmqrouter ERROR    can max delay: 0:00:00.286025
2020-07-28 18:03:57,402 osgar.zmqrouter ERROR    serial max delay: 0:00:00.472043
2020-07-28 18:03:57,403 osgar.zmqrouter ERROR    slope_lidar max delay: 0:00:00.145557
2020-07-28 18:03:57,403 osgar.zmqrouter INFO     imu_serial max delay: 0:00:00
2020-07-28 18:03:57,404 osgar.zmqrouter ERROR    imu max delay: 0:00:00.119980
2020-07-28 18:03:57,404 osgar.zmqrouter ERROR    estop max delay: 0:00:01.017559
2020-07-28 18:03:57,405 osgar.zmqrouter ERROR    lora_serial max delay: 0:00:01.365879
2020-07-28 18:03:57,405 osgar.zmqrouter INFO     estop_serial max delay: 0:00:00.093046
2020-07-28 18:03:57,406 osgar.zmqrouter ERROR    eduro max delay: 0:00:00.104448
2020-07-28 18:03:57,406 osgar.zmqrouter ERROR    lidar max delay: 0:00:00.148295
2020-07-28 18:03:57,407 osgar.zmqrouter INFO     camera max delay: 0:00:00
2020-07-28 18:03:57,407 osgar.zmqrouter INFO     lora max delay: 0:00:00.022605
2020-07-28 18:03:57,407 osgar.zmqrouter ERROR    lidar_tcp max delay: 0:00:00.174903
2020-07-28 18:03:57,408 osgar.zmqrouter ERROR    slope_lidar_tcp max delay: 0:00:00.233206
2020-07-28 18:03:57,408 osgar.zmqrouter INFO     reporter max delay: 0:00:00.007081
2020-07-28 18:03:57,409 osgar.zmqrouter INFO     app max delay: 0:00:00.064763
2020-07-28 18:03:57,409 osgar.zmqrouter ERROR    detector max delay: 0:00:02.035327

m3d added 2 commits July 28, 2020 19:31
Otherwise it fails on Eduro:
root@voyage:/home/robot/git/osgar# python -m subt run config/eduro-subt-estop-lora.json --side right --speed 0.3 --walldist 0.3 --timeout 30 --note "test master with processes"
2020-07-14 16:28:47,499 osgar.zmqrouter INFO     recording...
2020-07-14 16:28:47,502 osgar.zmqrouter INFO     /home/robot/git/osgar/logs/eduro-subt-estop-lora-200714_162847.log
  File "/home/robot/git/osgar/osgar/zmqrouter.py", line 60, in record
    router.register_nodes(modules.keys(), timeout=datetime.timedelta(seconds=1))
  File "/home/robot/git/osgar/osgar/zmqrouter.py", line 121, in register_nodes
    for packet in self.receive(timeout=timeout):
  File "/home/robot/git/osgar/osgar/zmqrouter.py", line 174, in receive
    raise RuntimeError("timeout")
2020-07-14 16:28:49,013 osgar.zmqrouter ERROR    timeout
2020-07-14 16:28:49,015 osgar.zmqrouter INFO     exception requested stop
@m3d m3d requested a review from zbynekwinkler July 28, 2020 18:11
@m3d
Copy link
Member Author

m3d commented Jul 28, 2020

p.s. note, that the problem is primarily the "low level CAN loop", which without slots does not have sufficient priority

2020-07-28 18:22:45,068 osgar.record ERROR             app: maximum delay of 0:00:00.228250 at 0:00:05.734070
2020-07-28 18:22:45,069 osgar.record ERROR        detector: maximum delay of 0:00:03.340493 at 0:00:03.703160
2020-07-28 18:22:45,069 osgar.record ERROR           eduro: maximum delay of 0:00:00.158365 at 0:00:05.731323
2020-07-28 18:22:45,070 osgar.record ERROR             can: maximum delay of 0:00:02.511137 at 0:00:16.569695
2020-07-28 18:22:45,070 osgar.record ERROR          serial: maximum delay of 0:00:00.341134 at 0:00:05.893867
2020-07-28 18:22:45,071 osgar.record ERROR           lidar: maximum delay of 0:00:00.246748 at 0:00:15.410367
2020-07-28 18:22:45,071 osgar.record ERROR     slope_lidar: maximum delay of 0:00:00.279216 at 0:00:10.834190
2020-07-28 18:22:45,072 osgar.record ERROR    slope_lidar_tcp: maximum delay of 0:00:00.163550 at 0:00:05.902296
2020-07-28 18:22:45,072 osgar.record ERROR             imu: maximum delay of 0:00:00.244797 at 0:00:09.211370
2020-07-28 18:22:45,072 osgar.record ERROR           estop: maximum delay of 0:00:01.106129 at 0:00:05.893163
2020-07-28 18:22:45,073 osgar.record ERROR    estop_serial: maximum delay of 0:00:00.135178 at 0:00:15.700061
2020-07-28 18:22:45,073 osgar.record ERROR     lora_serial: maximum delay of 0:00:01.627332 at 0:00:11.984959
2020-07-28 18:22:45,074 osgar.record INFO     committing suicide by SIGINT

this is result with old Threads

@m3d
Copy link
Member Author

m3d commented Jul 28, 2020

p.s.2 and slot version from master using old Threads:

2020-07-28 18:34:36,715 osgar.record ERROR             app: maximum delay of 0:00:00.230087 at 0:01:19.078037
2020-07-28 18:34:36,716 osgar.record ERROR        detector: maximum delay of 0:00:03.826929 at 0:00:38.948225
2020-07-28 18:34:36,716 osgar.record ERROR           lidar: maximum delay of 0:00:00.197656 at 0:00:08.341907
2020-07-28 18:34:36,716 osgar.record ERROR       lidar_tcp: maximum delay of 0:00:00.100110 at 0:01:18.955104
2020-07-28 18:34:36,717 osgar.record ERROR     slope_lidar: maximum delay of 0:00:00.303620 at 0:00:13.927750
2020-07-28 18:34:36,717 osgar.record ERROR    slope_lidar_tcp: maximum delay of 0:00:00.215104 at 0:01:10.362243
2020-07-28 18:34:36,718 osgar.record ERROR             imu: maximum delay of 0:00:00.173727 at 0:00:00.687396
2020-07-28 18:34:36,718 osgar.record ERROR           estop: maximum delay of 0:00:01.100149 at 0:00:13.572992
2020-07-28 18:34:36,718 osgar.record ERROR    estop_serial: maximum delay of 0:00:00.148416 at 0:00:13.721408
2020-07-28 18:34:36,719 osgar.record ERROR     lora_serial: maximum delay of 0:00:02.014870 at 0:01:01.039011

(this works OK with one reset on return home)

@m3d
Copy link
Member Author

m3d commented Jul 28, 2020

Well, what are the delays with zmqrouter?

they are in the original PR post (on the top) - the report table is probably different for zmqrouter and old record

@zbynekwinkler
Copy link
Member

We have an improvement

2020-07-28 18:03:57,402 osgar.zmqrouter ERROR    can max delay: 0:00:00.286025
2020-07-28 18:03:57,402 osgar.zmqrouter ERROR    serial max delay: 0:00:00.472043

vs

2020-07-28 18:22:45,070 osgar.record ERROR             can: maximum delay of 0:00:02.511137 at 0:00:16.569695
2020-07-28 18:22:45,070 osgar.record ERROR          serial: maximum delay of 0:00:00.341134 at 0:00:05.893867

However it seems it is not enough. If you feel like more testing, it might be worth to try increasing priority for serial and can processes (is anything else involved in this loop?). It might be also worth checking the overall CPU load when the program runs (if we are at 400% (I assume the computer has 4 cores), it might prove to be difficult to get a reliable improvement) - htop would be the tool I'd choose.

@m3d
Copy link
Member Author

m3d commented Jul 28, 2020

in the loop is also eduro module sending the commands

@m3d
Copy link
Member Author

m3d commented Jul 28, 2020

TOP:

root@voyage:~# top
top - 19:19:15 up 5 min,  2 users,  load average: 1.76, 0.45, 0.16
Tasks:  94 total,   3 running,  91 sleeping,   0 stopped,   0 zombie
%Cpu(s): 57.0 us, 17.1 sy,  0.0 ni, 23.0 id,  0.0 wa,  0.0 hi,  2.9 si,  0.0 st
KiB Mem:   3973312 total,   615892 used,  3357420 free,     3604 buffers
KiB Swap:        0 total,        0 used,        0 free.   163540 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1733 root      20   0  627292  74304  46756 R 115.8  1.9   0:18.77 python
 1737 root      20   0  908352  77696  48652 S  42.4  2.0   0:09.25 python
 1740 root      20   0  416504  42208  23360 S  27.8  1.1   0:04.43 python
 1739 root      20   0  416504  42220  23368 S  25.8  1.1   0:04.04 python
 1748 root      20   0  490764  41944  23020 S  13.9  1.1   0:03.37 python
 1736 root      20   0  908808  72344  47144 S  13.6  1.8   0:32.59 python
 1741 root      20   0  490764  42324  23400 S  13.2  1.1   0:03.14 python
 1747 root      20   0  416504  42632  23792 S  11.6  1.1   0:03.44 python
 1744 root      20   0  416760  42452  23460 S   7.9  1.1   0:02.59 python
 1750 root      20   0  490764  42292  23372 S   7.0  1.1   0:02.58 python
 1752 root      20   0  490640  43052  23932 S   6.6  1.1   0:02.49 python
 1745 root      20   0  493140  44208  23544 S   6.3  1.1   0:02.56 python
 1742 root      20   0  416504  42360  23488 S   5.0  1.1   0:02.21 python
 1743 root      20   0  493140  44124  23460 S   4.3  1.1   0:02.34 python
 1751 root      20   0  416504  42216  23376 S   2.3  1.1   0:01.66 python
 1656 root      20   0   93296   6188   5260 S   1.3  0.2   0:00.49 sshd

and run with camera image with 5s sleep

2020-07-28 19:19:27,410 osgar.zmqrouter INFO     eduro max delay: 0:00:00.093942
2020-07-28 19:19:27,411 osgar.zmqrouter ERROR    lidar max delay: 0:00:00.140249
2020-07-28 19:19:27,411 osgar.zmqrouter ERROR    serial max delay: 0:00:00.503201
2020-07-28 19:19:27,412 osgar.zmqrouter ERROR    can max delay: 0:00:00.222875
2020-07-28 19:19:27,413 osgar.zmqrouter ERROR    imu max delay: 0:00:00.119570
2020-07-28 19:19:27,413 osgar.zmqrouter ERROR    estop max delay: 0:00:01.013109
2020-07-28 19:19:27,413 osgar.zmqrouter INFO     lora max delay: 0:00:00.013467
2020-07-28 19:19:27,414 osgar.zmqrouter ERROR    lora_serial max delay: 0:00:01.617519
2020-07-28 19:19:27,414 osgar.zmqrouter ERROR    slope_lidar max delay: 0:00:00.142748
2020-07-28 19:19:27,415 osgar.zmqrouter INFO     imu_serial max delay: 0:00:00.013677
2020-07-28 19:19:27,415 osgar.zmqrouter INFO     estop_serial max delay: 0:00:00.069178
2020-07-28 19:19:27,416 osgar.zmqrouter ERROR    slope_lidar_tcp max delay: 0:00:00.198054
2020-07-28 19:19:27,416 osgar.zmqrouter INFO     camera max delay: 0:00:00
2020-07-28 19:19:27,417 osgar.zmqrouter ERROR    lidar_tcp max delay: 0:00:00.109864
2020-07-28 19:19:27,417 osgar.zmqrouter ERROR    detector max delay: 0:00:02.025116
2020-07-28 19:19:27,417 osgar.zmqrouter INFO     app max delay: 0:00:00.059534
2020-07-28 19:19:27,418 osgar.zmqrouter INFO     reporter max delay: 0:00:00

any suggestion how to easily define "high priority modules"? (probably next time)

@m3d
Copy link
Member Author

m3d commented Jul 28, 2020

p.s. I am a little bit confused from another run (when robot almost hit the obstacle) with output

2020-07-28 19:26:03,203 osgar.zmqrouter  INFO     eduro received quit
2020-07-28 19:26:03,204 osgar.zmqrouter  INFO     eduro finished
2020-07-28 19:26:03,207 osgar.zmqrouter  DEBUG    eduro stop confirmed
queue delay 0:00:04.986087
queue delay 0:00:04.951402
RUNNING 1
queue delay 0:00:04.897472
queue delay 0:00:04.850950
queue delay 0:00:04.773845
RESET 2
queue delay 0:00:04.790585
RESET 1
2020-07-28 19:26:03,810 osgar.zmqrouter ERROR    failed to stop within timeout of 0:00:02, exiting anyway
2020-07-28 19:26:03,811 osgar.zmqrouter ERROR    b'can' queue: 216
2020-07-28 19:26:03,811 osgar.zmqrouter ERROR    b'serial' queue: 32
2020-07-28 19:26:03,812 osgar.zmqrouter ERROR    b'imu' queue: 1
2020-07-28 19:26:03,812 osgar.zmqrouter ERROR    b'eduro' queue: 458
2020-07-28 19:26:03,812 osgar.zmqrouter ERROR    b'lora' queue: 17
2020-07-28 19:26:03,813 osgar.zmqrouter ERROR    b'estop_serial' queue: 1
2020-07-28 19:26:03,813 osgar.zmqrouter ERROR    b'camera' queue: 1
2020-07-28 19:26:03,814 osgar.zmqrouter ERROR    b'lidar_tcp' queue: 1
2020-07-28 19:26:03,814 osgar.zmqrouter ERROR    b'app' queue: 230
2020-07-28 19:26:03,814 osgar.zmqrouter ERROR    still running: {b'camera', b'app', b'can'}
2020-07-28 19:26:03,815 osgar.zmqrouter ERROR    can max delay: 0:00:00.191245
2020-07-28 19:26:03,815 osgar.zmqrouter ERROR    serial max delay: 0:00:00.343067
2020-07-28 19:26:03,816 osgar.zmqrouter INFO     imu max delay: 0:00:00.099311
2020-07-28 19:26:03,816 osgar.zmqrouter ERROR    slope_lidar max delay: 0:00:00.165040
2020-07-28 19:26:03,817 osgar.zmqrouter INFO     eduro max delay: 0:00:00.082344
2020-07-28 19:26:03,817 osgar.zmqrouter INFO     lora max delay: 0:00:00.007880
2020-07-28 19:26:03,818 osgar.zmqrouter INFO     estop_serial max delay: 0:00:00.070973
2020-07-28 19:26:03,818 osgar.zmqrouter ERROR    lidar max delay: 0:00:00.137090
2020-07-28 19:26:03,819 osgar.zmqrouter INFO     imu_serial max delay: 0:00:00.063564
2020-07-28 19:26:03,819 osgar.zmqrouter ERROR    estop max delay: 0:00:01.020443
2020-07-28 19:26:03,819 osgar.zmqrouter INFO     camera max delay: 0:00:00
2020-07-28 19:26:03,820 osgar.zmqrouter ERROR    lidar_tcp max delay: 0:00:00.112282
2020-07-28 19:26:03,820 osgar.zmqrouter ERROR    lora_serial max delay: 0:00:01.372562
2020-07-28 19:26:03,821 osgar.zmqrouter ERROR    slope_lidar_tcp max delay: 0:00:00.192858
2020-07-28 19:26:03,821 osgar.zmqrouter ERROR    detector max delay: 0:00:01.653852
2020-07-28 19:26:03,822 osgar.zmqrouter INFO     app max delay: 0:00:00.059906
2020-07-28 19:26:03,822 osgar.zmqrouter INFO     reporter max delay: 0:00:00

where I do not see this extra large (more than 4s) in the final report, i.e. the queue size is probably important (if it is not small, then the time can be misleading?)

@m3d
Copy link
Member Author

m3d commented Jul 28, 2020

another sample from the middle of the run

queue delay 0:00:03.987551
2020-07-28 19:26:00,693 osgar.zmqrouter ERROR    {'name': b'serial', 'delay': 0.107104, 'channel': b'raw'}
2020-07-28 19:26:00,726 osgar.zmqrouter ERROR    {'name': b'serial', 'delay': 0.139967, 'channel': b'raw'}
2020-07-28 19:26:00,736 osgar.zmqrouter ERROR    {'name': b'lidar', 'delay': 0.118761, 'channel': b'raw'}
2020-07-28 19:26:00,756 osgar.zmqrouter ERROR    {'name': b'serial', 'delay': 0.170463, 'channel': b'raw'}
2020-07-28 19:26:00,778 osgar.zmqrouter ERROR    {'name': b'slope_lidar', 'delay': 0.112106, 'channel': b'raw'}
2020-07-28 19:26:00,783 osgar.zmqrouter ERROR    {'name': b'serial', 'delay': 0.197284, 'channel': b'raw'}
2020-07-28 19:26:00,811 osgar.zmqrouter ERROR    {'name': b'serial', 'delay': 0.225987, 'channel': b'raw'}
2020-07-28 19:26:00,837 osgar.zmqrouter ERROR    {'name': b'serial', 'delay': 0.251475, 'channel': b'raw'}
queue delay 0:00:03.990887
2020-07-28 19:26:00,887 osgar.zmqrouter ERROR    {'name': b'lidar', 'delay': 0.117554, 'channel': b'raw'}
2020-07-28 19:26:00,960 osgar.zmqrouter ERROR    {'name': b'serial', 'delay': 0.100285, 'channel': b'raw'}
2020-07-28 19:26:00,967 osgar.zmqrouter ERROR    {'name': b'can', 'delay': 0.10498, 'channel': b'can'}
2020-07-28 19:26:00,979 osgar.zmqrouter ERROR    {'name': b'serial', 'delay': 0.119216, 'channel': b'raw'}
2020-07-28 19:26:00,983 osgar.zmqrouter ERROR    {'name': b'slope_lidar', 'delay': 0.113482, 'channel': b'raw'}
2020-07-28 19:26:01,031 osgar.zmqrouter ERROR    {'name': b'lidar', 'delay': 0.115519, 'channel': b'raw'}
2020-07-28 19:26:01,037 osgar.zmqrouter ERROR    {'name': b'serial', 'delay': 0.176752, 'channel': b'raw'}
queue delay 0:00:04.027473

@zbynekwinkler
Copy link
Member

TOP

Unfortunately, I am not sure how to interpret that - it requires so fu I am not in possession of (that is why I suggested htop since it has human readable output). But to even attempt it, the number of CPU cores is critical variable.

any suggestion how to easily define "high priority modules"? (probably next time)

I would just go and edit the __init__ of the respective modules for the test if it makes any difference.

@zbynekwinkler
Copy link
Member

queue delay 0:00:03.987551

I am not sure how this delay is defined. Quick scan through the code suggests it measures the frequency of lidar scans (?).

Can you upload some log?

Anyway, it seems the latency is still not good enough. I'll try to think about it some more.

@zbynekwinkler
Copy link
Member

2020-07-28 19:26:03,810 osgar.zmqrouter ERROR    failed to stop within timeout of 0:00:02, exiting anyway
2020-07-28 19:26:03,811 osgar.zmqrouter ERROR    b'can' queue: 216
2020-07-28 19:26:03,811 osgar.zmqrouter ERROR    b'serial' queue: 32
2020-07-28 19:26:03,812 osgar.zmqrouter ERROR    b'imu' queue: 1
2020-07-28 19:26:03,812 osgar.zmqrouter ERROR    b'eduro' queue: 458
2020-07-28 19:26:03,812 osgar.zmqrouter ERROR    b'lora' queue: 17
2020-07-28 19:26:03,813 osgar.zmqrouter ERROR    b'estop_serial' queue: 1
2020-07-28 19:26:03,813 osgar.zmqrouter ERROR    b'camera' queue: 1
2020-07-28 19:26:03,814 osgar.zmqrouter ERROR    b'lidar_tcp' queue: 1
2020-07-28 19:26:03,814 osgar.zmqrouter ERROR    b'app' queue: 230
2020-07-28 19:26:03,814 osgar.zmqrouter ERROR    still running: {b'camera', b'app', b'can'}

There seems to be something really really broken here. The nodes app, eduro, can and serial have a way many messages in the queue. Three of them even fail to notice that they should quit within 2s timeout. So the delay really must be longer. It would be interesting to know, what were the maximum delays reported at the end. For that, it would be useful to be able to check the log since the delays are logged there while these error messages with queue lengths are reported only on the console during the original run.

@zbynekwinkler
Copy link
Member

TOP

>>> 115.8 + 42.4 + 27.8 + 25.8 + 13.9 + 13.6 + 13.2 + 11.6 + 7.9 + 7 + 6.6 + 6.3 + 5 + 4.3 + 2.3
303.50000000000006

So... if the system has 4 CPUs we should have some spare resources.

@zbynekwinkler
Copy link
Member

Combined with load 1.76, I think the hint is that the processes do not run often enough so there is too much waiting (otherwise the load would be around 4). I am thinking about trying to increase the priority of the parent process.

@zbynekwinkler
Copy link
Member

Here is a sample output from htop:
image
It has everything we could need - the top shows per cpu utilization, lower is the load and bottom shows the process tree along with cpu and memory usage.

@zbynekwinkler
Copy link
Member

There are parts within publish that could be moved into yet another process to increase parallelization and decrease latency (compression and log writing):

osgar/osgar/zmqrouter.py

Lines 204 to 222 in 8165f57

def publish(self, sender, channel, data):
dt = self.now()
dt_uint32 = osgar.logger.format_timedelta(dt)
self.socket.send_multipart([sender, b"", b"publish", dt_uint32])
link_from = sender + b"." + channel
for node_name, input_channel in self.subscriptions.get(link_from, []):
self.send(node_name, b'listen', dt, input_channel, data)
stream_id = self.stream_id[link_from]
if stream_id in self.no_output:
data = osgar.lib.serialize.serialize(None)
elif stream_id in self.compressed_output:
data = osgar.lib.serialize.compress(data)
self.logger.write(stream_id, data, dt)
delay = dt - self.last_listen.get(sender, dt)
if self.delays.get(sender, datetime.timedelta()) < delay:
self.delays[sender] = delay
if delay > ASSERT_QUEUE_DELAY:
self.report_error(name=sender, delay=delay.total_seconds(), channel=channel)

I have not done it yet, because it would involve yet another queue (inside zmq) and I am not sure yet how to make sure it does not grow without bounds. However, solution like that would keep all the good parts of the current solution and since shutdown is very much under control, I am not afraid anymore that the log would not be complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants