Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ethernet starts breaking after no response to inbound packets on the port #137

Closed
HarryMakes opened this issue Sep 10, 2020 · 7 comments
Closed

Comments

@HarryMakes
Copy link
Contributor

HarryMakes commented Sep 10, 2020

Prior to the emergence of this Ethernet issue, I was able to correctly flash various tagged versions of the firmware to the board. For those cases, Pinging is reliably successful and proper responses can be received from the board. For example, considering that stabilizer.py hasn't been updated since v0.3.0, using this frontend would get an ok on v0.3.0, or a parse error on v0.4.0 / 0.4.1.

However, some time after numerous reflashing the board, the Ethernet begins to break.

  1. Sometimes, if RJ45 has been plugged in before powering the board (with 12V), Pinging would fail.
  2. After Pinging has been guaranteed to work properly, when I use stabilizer.py to send something on the TCP port (1235), the script would hang at trying to poll for the response.
  3. Afterwards, if I try Pinging again, the board would stop replying to my computer. Edit: Pinging would fail often but not always after using stabilizer.py; but this could sometimes be reproduced by killing stabilizer.py and re-running it.

Below is the tshark dump while these issues happen (I hardcoded the board IP as 192.168.1.79, and following lines with ### are my annotations):

$ tshark -f "host 192.168.1.79"
Capturing on 'eno1'
### Start Pinging the board that has been powered, I get replies.
    1 0.000000000 Micro-St_26:b8:26 → Broadcast    ARP 42 Who has 192.168.1.79? Tell 192.168.1.116
    2 0.000176686 Microchi_d2:89:a1 → Micro-St_26:b8:26 ARP 60 192.168.1.79 is at 04:91:62:d2:89:a1
    3 0.000180615 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0024, seq=1/256, ttl=64
    4 0.000420384 192.168.1.79 → 192.168.1.116 ICMP 98 Echo (ping) reply    id=0x0024, seq=1/256, ttl=64 (request in 3)
    5 1.047154568 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0024, seq=2/512, ttl=64
    6 1.047364219 192.168.1.79 → 192.168.1.116 ICMP 98 Echo (ping) reply    id=0x0024, seq=2/512, ttl=64 (request in 5)
    7 2.071126156 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0024, seq=3/768, ttl=64
    8 2.071347588 192.168.1.79 → 192.168.1.116 ICMP 98 Echo (ping) reply    id=0x0024, seq=3/768, ttl=64 (request in 7)
    9 3.095137617 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0024, seq=4/1024, ttl=64
   10 3.095361057 192.168.1.79 → 192.168.1.116 ICMP 98 Echo (ping) reply    id=0x0024, seq=4/1024, ttl=64 (request in 9)
### Start using `stabilizer.py`, e.g. `python -m stabilizer -c 0 -p 1.0`.
   11 6.562644070 192.168.1.116 → 192.168.1.79 TCP 74 59600 → 1235 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=335328363 TSecr=0 WS=128
### There seem to be intermittent outgoing packets from the board...
### However, `stabilizer.py` keeps polling for responses while there has been none.
   12 6.562989099 192.168.1.79 → 192.168.1.116 TCP 66 1235 → 59600 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 WS=1 SACK_PERM=1
   13 6.563003995 192.168.1.116 → 192.168.1.79 TCP 54 59600 → 1235 [ACK] Seq=1 Ack=1 Win=64256 Len=0
   14 6.563287760 192.168.1.116 → 192.168.1.79 TCP 151 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
### TCP retransmissions from my computer happen.
   15 6.767079080 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   16 6.975075893 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   17 7.383108497 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   18 8.215082051 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   19 9.879078601 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   20 12.066799297 192.168.1.116 → 192.168.1.79 TCP 54 59600 → 1235 [FIN, ACK] Seq=98 Ack=1 Win=64256 Len=0
   21 13.010963441 192.168.1.116 → 192.168.1.79 TCP 74 59602 → 1235 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=335334812 TSecr=0 WS=128
   22 13.011233708 192.168.1.79 → 192.168.1.116 TCP 60 1235 → 59602 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
   23 13.143117939 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [FIN, PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   24 13.143390122 192.168.1.79 → 192.168.1.116 TCP 60 1235 → 59600 [ACK] Seq=1 Ack=99 Win=5840 Len=0
   25 13.143444053 192.168.1.79 → 192.168.1.116 TCP 60 1235 → 59600 [FIN, ACK] Seq=1 Ack=99 Win=5840 Len=0
   26 13.143468326 192.168.1.116 → 192.168.1.79 TCP 54 59600 → 1235 [ACK] Seq=99 Ack=2 Win=64256 Len=0
### Halting `stabilizer.py`. Now try to Ping again, but I get no more replies.
   27 15.512267842 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=1/256, ttl=64
   28 16.536139917 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=2/512, ttl=64
   29 17.559131304 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=3/768, ttl=64
   30 18.583151300 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=4/1024, ttl=64
   31 19.607132085 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=5/1280, ttl=64
   32 20.631157921 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=6/1536, ttl=64
   33 21.655130647 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=7/1792, ttl=64
   34 22.679144554 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=8/2048, ttl=64
@jordens
Copy link
Member

jordens commented Sep 10, 2020

May or may not be a hardware issue. I've also seen ethernet break (see the issues on the hardware repository) and there was one case where apparently the phy rst and tx lines where shorted.

@ryan-summers
Copy link
Member

Prior to the emergence of this Ethernet issue, I was able to correctly flash various tagged versions of the firmware to the board. For those cases, Pinging is reliably successful and proper responses can be received from the board. For example, considering that stabilizer.py hasn't been updated since v0.3.0, using this frontend would get an ok on v0.3.0, or a parse error on v0.4.0 / 0.4.1.

However, some time after numerous reflashing the board, the Ethernet begins to break.

  1. Sometimes, if RJ45 has been plugged in before powering the board (with 12V), Pinging would fail.

Are you using a PoE (Power-over-ethernet) switch? The RJ45 connector can also supply power to the board, so you may be powering it before you are aware. We have also observed stabilizer not function properly with PoE switches. See sinara-hw/Stabilizer#76

  1. After Pinging has been guaranteed to work properly, when I use stabilizer.py to send something on the TCP port (1235), the script would hang at trying to poll for the response.

stabilizer.py is deprecated with recent firmware updates. See #102 - I would not expect any communication to be successful with recent firmware changes.

  1. Afterwards, if I try Pinging again, the board would stop replying to my computer. Edit: Pinging would fail often but not always after using stabilizer.py; but this could sometimes be reproduced by killing stabilizer.py and re-running it.

This sounds a lot like sinara-hw/Stabilizer#76

Below is the tshark dump while these issues happen (I hardcoded the board IP as 192.168.1.79, and following lines with ### are my annotations):

$ tshark -f "host 192.168.1.79"
Capturing on 'eno1'
### Start Pinging the board that has been powered, I get replies.
    1 0.000000000 Micro-St_26:b8:26 → Broadcast    ARP 42 Who has 192.168.1.79? Tell 192.168.1.116
    2 0.000176686 Microchi_d2:89:a1 → Micro-St_26:b8:26 ARP 60 192.168.1.79 is at 04:91:62:d2:89:a1
    3 0.000180615 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0024, seq=1/256, ttl=64
    4 0.000420384 192.168.1.79 → 192.168.1.116 ICMP 98 Echo (ping) reply    id=0x0024, seq=1/256, ttl=64 (request in 3)
    5 1.047154568 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0024, seq=2/512, ttl=64
    6 1.047364219 192.168.1.79 → 192.168.1.116 ICMP 98 Echo (ping) reply    id=0x0024, seq=2/512, ttl=64 (request in 5)
    7 2.071126156 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0024, seq=3/768, ttl=64
    8 2.071347588 192.168.1.79 → 192.168.1.116 ICMP 98 Echo (ping) reply    id=0x0024, seq=3/768, ttl=64 (request in 7)
    9 3.095137617 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0024, seq=4/1024, ttl=64
   10 3.095361057 192.168.1.79 → 192.168.1.116 ICMP 98 Echo (ping) reply    id=0x0024, seq=4/1024, ttl=64 (request in 9)
### Start using `stabilizer.py`, e.g. `python -m stabilizer -c 0 -p 1.0`.
   11 6.562644070 192.168.1.116 → 192.168.1.79 TCP 74 59600 → 1235 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=335328363 TSecr=0 WS=128
### There seem to be intermittent outgoing packets from the board...
### However, `stabilizer.py` keeps polling for responses while there has been none.
   12 6.562989099 192.168.1.79 → 192.168.1.116 TCP 66 1235 → 59600 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 WS=1 SACK_PERM=1
   13 6.563003995 192.168.1.116 → 192.168.1.79 TCP 54 59600 → 1235 [ACK] Seq=1 Ack=1 Win=64256 Len=0
   14 6.563287760 192.168.1.116 → 192.168.1.79 TCP 151 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
### TCP retransmissions from my computer happen.
   15 6.767079080 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   16 6.975075893 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   17 7.383108497 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   18 8.215082051 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   19 9.879078601 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   20 12.066799297 192.168.1.116 → 192.168.1.79 TCP 54 59600 → 1235 [FIN, ACK] Seq=98 Ack=1 Win=64256 Len=0
   21 13.010963441 192.168.1.116 → 192.168.1.79 TCP 74 59602 → 1235 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=335334812 TSecr=0 WS=128
   22 13.011233708 192.168.1.79 → 192.168.1.116 TCP 60 1235 → 59602 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
   23 13.143117939 192.168.1.116 → 192.168.1.79 TCP 151 [TCP Retransmission] 59600 → 1235 [FIN, PSH, ACK] Seq=1 Ack=1 Win=64256 Len=97
   24 13.143390122 192.168.1.79 → 192.168.1.116 TCP 60 1235 → 59600 [ACK] Seq=1 Ack=99 Win=5840 Len=0
   25 13.143444053 192.168.1.79 → 192.168.1.116 TCP 60 1235 → 59600 [FIN, ACK] Seq=1 Ack=99 Win=5840 Len=0
   26 13.143468326 192.168.1.116 → 192.168.1.79 TCP 54 59600 → 1235 [ACK] Seq=99 Ack=2 Win=64256 Len=0
### Halting `stabilizer.py`. Now try to Ping again, but I get no more replies.
   27 15.512267842 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=1/256, ttl=64
   28 16.536139917 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=2/512, ttl=64
   29 17.559131304 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=3/768, ttl=64
   30 18.583151300 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=4/1024, ttl=64
   31 19.607132085 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=5/1280, ttl=64
   32 20.631157921 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=6/1536, ttl=64
   33 21.655130647 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=7/1792, ttl=64
   34 22.679144554 192.168.1.116 → 192.168.1.79 ICMP 98 Echo (ping) request  id=0x0025, seq=8/2048, ttl=64

The entire TCP layer got reworked in recent firmware. stabilizer.py should not be able to make a reliable connection - it's been a TODO. There's an example of the updated API in pounder_test.py, but there may be upcoming changes to ethernet (e.g. conversion to MQTT) in the near future so I believe the ethernet interface has been on hold.

@HarryMakes
Copy link
Contributor Author

HarryMakes commented Sep 10, 2020

Thank you for the replies.

@ryan-summers I'd like to give the following response:

Are you using a PoE (Power-over-ethernet) switch? The RJ45 connector can also supply power to the board, so you may be powering it before you are aware.

This sounds a bit weird. With only the Cat-6 cable plugged into RJ45 without other power supply, measurement at the various Test Points shows there is no voltage fed to the board, and no LEDs (including the indicators on the RJ45 port) light up.

stabilizer.py is deprecated with recent firmware updates. See #102

Thanks for the link, I came across it before posting and I understand the situation. However, as I wrote in my first paragraph, when Ethernet breakage wasn't happening before, even if I use the front-end with firmware v0.4.x, I would still at least get a parse error response - now there's just nothing coming out of the TCP socket (as it seems).

The entire TCP layer got reworked in recent firmware.

I understand re-work is ongoing so perhaps this particular Ethernet breakage problem could be a result from it? By the way, when I run pounder_test.py it still seems to have stuck at connecting to the port:

$ python -m pounder_test
^CTraceback (most recent call last):
  File "/nix/store/r94aa2gj4drkhfvkm2p4ab6cblb6kxlq-python3-3.7.6/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/nix/store/r94aa2gj4drkhfvkm2p4ab6cblb6kxlq-python3-3.7.6/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/harry/quartiq-stabilizer/pounder_test.py", line 111, in <module>
    main()
  File "/home/harry/quartiq-stabilizer/pounder_test.py", line 90, in main
    s.connect((HOST, PORT))
KeyboardInterrupt

I get that the request packet format has been changed between v0.3 and v0.4, but such a difference should only result in a parse error type of message on the API level, not a complete lack of response on the transport or lower layer on the stack.

@HarryMakes
Copy link
Contributor Author

@jordens I've checked my RST and the TX/RX pins, they don't seem to be shorted.

@HarryMakes
Copy link
Contributor Author

HarryMakes commented Sep 10, 2020

Just to be more clear, it is always fine by just Pinging without trying to use stabilizer.py. As long as I don't try to connect to the TCP port, the board continues replying to my Ping.

@HarryMakes
Copy link
Contributor Author

After numerous testing, this issue might've come from the CPU's failure to send a reply after receiving any packets on the port. I commented out all the calls to the json_reply() function in server.rs (see these lines) and the Ethernet no longer has a chance to break.

I do notice an error will raise on this line of the json_reply() function. Can this be replicated by anyone else?

@HarryMakes HarryMakes changed the title Ethernet starts breaking after no response to inbound packets on the port server::json_reply() raises error after receiving packets on the port Sep 14, 2020
@HarryMakes
Copy link
Contributor Author

I'm closing this issue and open 2 new issues, addressing two different Ethernet symptoms:

  1. Ethernet might break upon power-cycling.
  2. When Pinging works after power-cycling, sending any packets to the port might make the CPU panic.

@HarryMakes HarryMakes changed the title server::json_reply() raises error after receiving packets on the port Ethernet starts breaking after no response to inbound packets on the port Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants