-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations for achieving higher cycletimes #101
Comments
I also have the instability issue. When i change the update frequency as 1kHz, one of my motors switches between enabled and disnabled states frequently. When the frequency is 100Hz, there is no such issue. Could you please provide more information on how to performs your the three optimizations? I don't know how to do that in the code. |
@wjx1991 Underlying problem will not be fixed by these optimizations as they require changeing of the EtherCAT driver source code but they improve the stability a lot. NIC delay turnoff: Check the ethtool in Linux in specific the coalescing settings. These are throughput optimizations which are holding messages back what we definetly don't want with EtherCAT communication. Isolated cores: This is done through changeing the grub settings that get applied to the kernel on startup. More info can be found in a cannonical blog post on RT-behaviour optimization. Assigning IRQ's to same core: This I have done through a startup script when starting the EtherCAT master to look for specific NIC IRQ's and set the affinity to the isolated core running the EtherCAT master and driver. |
sorry for delays. thank you very much. |
Hi Pieterjan, To be able to help you, we would like to have more information concerning the computer configuration on which your application is running. |
Hi Yguel, I'm indeed using an RT-patched Ubuntu minimal installation. This is running on an i5 IPC with 6cores. I also optimised the OS for realtime behaviour (BIOS settings / core isolation / ...). The cyclictest testtool shows latencies consistently under 100us even under 100% system load so I think the realtime behaviour of the system is ok. I also mapped the ROS 2 control realtime loop onto an isolated core aswel as the EtherCAT kernel module and the interrupt threads of the binded NIC. This in all shows good realtime performance and 1kHz does not seem to be a problem to achieve. The only think I saw was that due to the Read and Write function controlling the frame release of the EtherCAT communication that any loads inside an controller or driver has a direct impact on this and thus increases the jitter on the frame release. This would sometimes lead to slave state changes. It is a suggestion to improve performance and maybe communication stability in general. I also have other suggestions for improvement and can / want to help with the implementation but the CONTRIBUTING.md told to open issue for feature / bugs / ... so I wanted to wait your response before continuing. |
Hi Pieterjan, Here are a list of questions that may help us find the source of the problem: 1 - What ethernet driver do you use ? Especially if you use the generic ethernet driver when configuring the IgH EtherCAT master, we have observed that it may not be optimized for RT. If you can switch to one of the driver supported by Etherlab it will be closer to what we know works well. You use the correct way to raise your problem and if I have not answered earlier, it is because of an error from my part. |
Hi yguel, Below the information. 1: I indeed use the generic driver, I will look into if I can use another one to test. I also captured a tcpdump to analyse the frame timings that get sent out by the EtherCAT driver. It was here that I noticed that there is some jitter on the timings that the frames get released. Jitter I defined here as the deviation from the optimal frame release time (previous cycle start + cycletime). Deviation was an average of around 3: Currently running kernel 5.15.119-rt65 but have more or less the same results with version 6.1.77-rt24, Ubuntu 22.04.3 LTS, ROS2 Humble The test setup is running 4 motors in csp8 mode at an 1kHz update interval. As the sending of the frames is coupled to the triggering of the Read and Write function of the ROS 2 rt-loop I was wondering if this could be an issue and if it would be better to put the frame releases on a different rt-thread and let the Read and Write function just do data exchange from the domains for the ROS 2 control interfaces. This was the origin of my question.
|
Thank you for the details.
Those messages say that the working counter (WC: a counter that increments during the EtherCAT cycle each time a slave get the EtherCAT message frame) was 0, meaning no slaves got the datagram (No registered process data were exchanged). The correct State is 2 (All registered process data were exchanged), so I assume that normally you have 12 slaves on your EtherCAT bus, am-I correct ? Other possibilities to feed your research:
Side note, I would like to make a documentation to help users investigate the kind of problems you had. It looks like you have done good and extensive tests. It will help greatly people to have guidelines to debug such annoying problems. If you are interested by such document, I would greatly appreciate if you can give me feedbacks. |
Hi,
Additional documentation on how to debug common issues sounds nice, I can provide feedack for sure or help write it if you want (maybe dm me / start a discussion thread for this?). What I have done so far:
-> Conclusion of these tests seem to be that there is some jitter (+- 400us max) on the frame release and timings but with an acceptable average around +-85 us which is still well withing cycle time. Further analysis and different test cases under different loads still to be performed before I can be sure of my findings. side question: for exposing diagnostics and statistics such as the master / slave / domain state and timings from the driver, how would you think this is best done? At the moment I made an extra ROS 2 node that uses a queue from the realtime thread to publish data on a diagnostics topic. It just feels a bit like working around the ROS 2 control framework doing it this way so I was wondering if you know a common / better way of doing this. Output of the dmesg: [19216.293852] EtherCAT: Requesting master 0...
[19216.293861] EtherCAT: Successfully requested master 0.
[19216.486231] EtherCAT 0: Domain0: Logical address 0x00000000, 116 byte, expected working counter 12.
[19216.486238] EtherCAT 0: Datagram domain0-0-main: Logical offset 0x00000000, 116 byte, type LRW.
[19216.486261] EtherCAT 0: Master thread exited.
[19216.486264] EtherCAT 0: Starting EtherCAT-OP thread.
[19216.486616] EtherCAT 0: binding thread to cpu 3
[19216.486629] EtherCAT WARNING 0: 1 datagram UNMATCHED!
[19217.609445] EtherCAT 0: Domain 0: Working counter changed to 2/12.
[19217.970384] EtherCAT 0: Slave states on main device: OP.
[19218.610434] EtherCAT 0: Domain 0: 7 working counter changes - now 12/12.
[19227.421439] EtherCAT 0: Domain 0: Working counter changed to 0/12.
[19227.486376] EtherCAT WARNING 0: 7 datagrams TIMED OUT!
[19227.486381] EtherCAT WARNING 0: 7 datagrams UNMATCHED!
[19228.422433] EtherCAT 0: Domain 0: Working counter changed to 12/12.
[19236.993434] EtherCAT 0: Domain 0: Working counter changed to 0/12.
[19237.486378] EtherCAT WARNING 0: 4 datagrams TIMED OUT!
[19237.486385] EtherCAT WARNING 0: 4 datagrams UNMATCHED!
[19237.994438] EtherCAT 0: Domain 0: Working counter changed to 12/12.
[19244.931437] EtherCAT 0: Domain 0: Working counter changed to 0/12.
[19245.486374] EtherCAT WARNING 0: 4 datagrams TIMED OUT!
[19245.486380] EtherCAT WARNING 0: 4 datagrams UNMATCHED!
[19245.932432] EtherCAT 0: Domain 0: Working counter changed to 12/12.
[19247.454438] EtherCAT 0: Domain 0: Working counter changed to 0/12.
[19247.486379] EtherCAT WARNING 0: 3 datagrams TIMED OUT!
[19247.486385] EtherCAT WARNING 0: 3 datagrams UNMATCHED!
[19248.455431] EtherCAT 0: Domain 0: Working counter changed to 12/12.
[19249.456443] EtherCAT 0: Domain 0: 2 working counter changes - now 12/12.
[19249.486374] EtherCAT WARNING 0: 7 datagrams TIMED OUT!
[19249.486380] EtherCAT WARNING 0: 7 datagrams UNMATCHED! -> This would lead me to believe that frames were skipped and I should thus indeed try the dedicated driver and check the NIC settings? |
Hi,
It looks interesting though that the number of datagrams always match in TIMED OUT and UNMATCHED. The delay is around 6µs (min,max) = (5µs,7µs). Which is way below 1ms. You could have more detailed kernel messages if you enable debug messages in the IgH EtherCAT master using the following command:
What kind of ethernet controller do you have ? For statistics, that is a tough ros2_control question ! |
HI, Thanks for the feedback. I see its an I210 based intel chip running the igb driver at the moment if I'm not mistaken. I'm trying to build a new kernel module with the igb driver enabled but having some issues at hte moment so I'll continue this tomorow / next week . I'll also check the kernel messages then. I can investigate a lot more already with your guidelines and I will give an update when I have more findings. Thanks for the help! For the logging bit I'm using a single producer consumer queue at the moment to tunnel the data out of the rt-thread. I started with the rt-publishers but found the queue a bit easier. Once out of the rt-context I then use a normal node to do some processing and publish some diagnostics. Felt a bit hacky so interesting to see how other people tackled this. |
Hi Update from my side. I tried the igb driver but could not get a connection to the slaves anymore then. I see in the dmesg output that it found the interface and uses it but something else is probably not setup right or I need another driver. I will take a further look later. I also did some further testing including a long running test with timings of the triggering of the Read() and Write() functions: jitter is described as the deviation to the ideal cycle time (prev + 1ms).
These seem to be ok for our purpose so I don't think I have an timing issue.
I had 1 frame 'overrun' during this collection period (delay > cycletime). I still have the datagram warnings but these do not seem to coincide with the perceived jitter (jitter on motor being modem like noises). Further investigation currently leads me to the controlling model (are all updates from the trajectory generator neatly passed to the ethercat_driver, ...) and the DC-Sync implementation. clock_gettime(CLOCK_REALTIME, &t);
ecrt_master_application_time(master_, EC_NEWTIMEVAL2NANO(t));
ecrt_master_sync_reference_clock(master_);
ecrt_master_sync_slave_clocks(master_); The 'CLOCK_REALTIME'is used instead of the CLOCK_MONOTONIC. So I'm investigating futher and if I find something I'll share again but at the moment it does not seem correlated to the ethercat_driver implementation. |
I'm trying to control several motors with CSP8-mode and have a trajectory generator integrated into an controller.
To have good behaviour I should update the motor at about 1kHz.
With some opimizations to the NIC-settings I am able to run this at 1Khz without much problems but now and then I do see some state changes occur on the slaves. When looking into this I noticed that these are likely delays of up to multiple ms propagating from the ROS 2 control RT-loop into the EtherCAT-driver. These delays cause noise to occur on the motors and seem to reduce the system stability.
Are there other people running at higher control loops that have / had the same issues and know some further optimizations to be done to achieve more stable system behaviour?
Current optimizations performed:
Apart from these optimizations done, I think the core issue of the delays in the ROS 2 control loop propagating to the interface handler can only be avoided by putting the function resposible for reading / writing to / from slaves on a different RT-Thread.
To achieve this I would think to put the Run function into an RT-Thread that just keeps sending / receiving the current domain to / from the slaves. The Read and Write function would then update this domain. The newly written information then gets sent to the slaves on the next cycle of the RT-Communication thread's cycle. I also think this is how it is done in e.g. the SOEM-EtherCAT library and I think also in the SimpleECAT library this is based on.
Is there anyone that can confirm / provide more background information about the design behind this or if this is something that as of now was no need for because of higher cycle times and so on?
The text was updated successfully, but these errors were encountered: