Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: QEMU Ethernet drivers are flaky (seemingly after "net_buf" refactor) #13943

Closed
pfalcon opened this issue Feb 28, 2019 · 21 comments · Fixed by #15241
Closed

net: QEMU Ethernet drivers are flaky (seemingly after "net_buf" refactor) #13943

pfalcon opened this issue Feb 28, 2019 · 21 comments · Fixed by #15241
Assignees
Labels
area: Ethernet area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@pfalcon
Copy link
Contributor

pfalcon commented Feb 28, 2019

Testing 3 qemu ethernet drivers we added in 1.14 timeframe with the idea to use them to test the IP stack better.

master: e731bdc
sample: samples/net/sockets/dumb_http_server
test command on host: ab -n1000 http://192.0.2.1:8080/

  1. BOARD=qemu_x86 OVERLAY=overlay-e1000.conf
    Not a single request succeeds:
$ ab -n1000 http://192.0.2.1:8080/
...
Benchmarking 192.0.2.1 (be patient)
apr_socket_recv: Connection reset by peer (104)

On Zephyr console (after ~5 ab attempts):

Booting from ROM..***** Booting Zephyr OS zephyr-v1.13.0-5221-g8ecd348022f9 *****
[00:00:00.019,243] <inf> net_config: Initializing network
[00:00:00.021,990] <inf> net_config: IPv4 address: 192.0.2.1
Single-threaded dumb HTTP server waits for a connection on port 8080...
[00:00:59.445,932] <err> eth_e1000: Out of memory for received frame
  1. BOARD=mps2_an385 OVERLAY=overlay-smsc911x.conf
    Zephyr:
Connection #137 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #138 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #139 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #140 from 192.0.2.2
...(lockup, pings don't work)...
  1. BOARD=qemu_cortex_m3 OVERLAY=overlay-qemu_cortex_m3_eth.conf
    Build error:
[ 54%] Building C object zephyr/CMakeFiles/zephyr.dir/drivers/ethernet/eth_stellaris.c.obj
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c: In function 'eth_stellaris_stats':
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c:291:24: error: 'struct eth_stellaris_runtime' has no member named 'stats'
  return &(DEV_DATA(dev)->stats);
                        ^~
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c: At top level:
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c:349:2: error: unknown field 'get_stats' specified in initializer
  .get_stats = eth_stellaris_stats,
  ^
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c:349:15: warning: initialization from incompatible pointer type [-Wincompatible-pointer-types]
  .get_stats = eth_stellaris_stats,
               ^~~~~~~~~~~~~~~~~~~
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c:349:15: note: (near initialization for 'eth_stellaris_apis.start')
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c: In function 'eth_stellaris_stats':
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c:292:1: warning: control reaches end of non-void function [-Wreturn-type]
 }
 ^
@pfalcon pfalcon added bug The issue is a bug, or the PR is fixing a bug area: Networking labels Feb 28, 2019
@pfalcon
Copy link
Contributor Author

pfalcon commented Feb 28, 2019

@rlubos, @rveerama1, @aurel32 FYI

@pfalcon
Copy link
Contributor Author

pfalcon commented Feb 28, 2019

qemu_cortex_m3 OVERLAY=overlay-qemu_cortex_m3_eth.conf
Build error:

Patch submitted: #13945

With it:

Single-threaded dumb HTTP server waits for a connection on port 8080...
Connection #0 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #1 from 192.0.2.2
Connection from 192.0.2.2 closed
[lockup]

@pfalcon pfalcon added this to the v1.14.0 milestone Feb 28, 2019
@pfalcon pfalcon added priority: medium Medium impact/importance bug area: Ethernet labels Feb 28, 2019
@jukkar
Copy link
Member

jukkar commented Feb 28, 2019

Tried with echo-server, UDP works just fine but TCP refuses to connect.

@pfalcon
Copy link
Contributor Author

pfalcon commented Feb 28, 2019

@jukkar

Tried with echo-server, UDP works just fine but TCP refuses to connect.

Thanks for testing, what BOARD it was?

@jukkar
Copy link
Member

jukkar commented Feb 28, 2019

Thanks for testing, what BOARD it was?

Just qemu_x86

jukkar added a commit to jukkar/zephyr that referenced this issue Feb 28, 2019
Multiple flag bits were set so the ACK flag set was not checked
properly which meant that connection establishment was not
successfull.

Fixes zephyrproject-rtos#13943

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
nashif pushed a commit that referenced this issue Mar 1, 2019
Multiple flag bits were set so the ACK flag set was not checked
properly which meant that connection establishment was not
successfull.

Fixes #13943

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
@pfalcon
Copy link
Contributor Author

pfalcon commented Mar 1, 2019

#13957 improved, but not fixed the situation. Reopening to track the situation over time.

@pfalcon pfalcon reopened this Mar 1, 2019
@pfalcon
Copy link
Contributor Author

pfalcon commented Mar 1, 2019

Master 5f2099f

  1. BOARD=qemu_x86 OVERLAY=overlay-e1000.conf
$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
apr_socket_recv: Connection reset by peer (104)
Total of 64 requests completed

$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
apr_socket_recv: Connection reset by peer (104)
Total of 262 requests completed

I.e. of 2 tests, they abort due to ECONNRESET, which is worse then my yesterday test for #13957 (no implications, it's just visibly flaky and non-deterministic). But at least Zephyr didn't lock up.

  1. BOARD=mps2_an385 OVERLAY=overlay-smsc911x.conf
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 434 requests completed

After that, Zephyr is locked up.

@pfalcon
Copy link
Contributor Author

pfalcon commented Mar 27, 2019

I finally did a new test sweep-over against revision 9722214. The results can be found in https://docs.google.com/spreadsheets/d/1_8CsACPEXqrMIbxBKxPAds091tNAwnwdWkMKr3994QY/edit#gid=0 as usual. Summary is:

  • eth_stellaris (qemu_cortex_m3) is completely broken. It sends packets with 14 extra bytes, which points to some issues with Ethernet header handling.
  • eth_e1000 mostly works, though I had a case of it resetting connection with dumb_http_server. And there're regular "Out of memory for received frame" error logging.
  • eth_smsc911x works without issues, yay.

@laperie
Copy link
Collaborator

laperie commented Apr 5, 2019

assigning to @pfalcon -- for stellaris it must be assigned to @bravegnu (see @pfalcon comment above)but somehow I cannot do that in the github interface

@laperie laperie added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Apr 5, 2019
@laperie
Copy link
Collaborator

laperie commented Apr 5, 2019

also lowering a priority since there's only one driver that is not working

@tbursztyka tbursztyka changed the title net: QEMU Ethernet drivers are flaky after "net_buf" refactor (actually, whole-stack shakeover) net: QEMU Ethernet drivers are flaky Apr 5, 2019
@tbursztyka
Copy link
Collaborator

Contacting eth_stellaris maintainer

@bravegnu
Copy link
Contributor

bravegnu commented Apr 5, 2019

Going through the change history, this change bff65b6#diff-36181bf0c30748729a35e42e076ac604L72 seemed to have introduced a regression. The 2 byte data_len written first should not include the header size.

bravegnu added a commit to zilogic-systems/zephyr that referenced this issue Apr 7, 2019
As part of the ll_reserve refactoring effort, the packet length now
includes header size as well. Before the refactor, when the packet
length was written to the device, it did not include the header size,
which is the required value as per the LM3S6965 datasheet. After the
refactor the packet length includes the header size as well. The
header size has to subtracted from the packet length before writing to
the device. Fixes zephyrproject-rtos#13943.

Signed-off-by: Vijay Kumar B <vijaykumar@zilogic.com>
nashif pushed a commit that referenced this issue Apr 11, 2019
As part of the ll_reserve refactoring effort, the packet length now
includes header size as well. Before the refactor, when the packet
length was written to the device, it did not include the header size,
which is the required value as per the LM3S6965 datasheet. After the
refactor the packet length includes the header size as well. The
header size has to subtracted from the packet length before writing to
the device. Fixes #13943.

Signed-off-by: Vijay Kumar B <vijaykumar@zilogic.com>
@pfalcon
Copy link
Contributor Author

pfalcon commented Apr 22, 2019

Unfortunately, as of master 8e307a3, this issue isn't really fixed, running ab -n1000 http://192.0.2.1:8080/ against dumb_http_server/dumb_http_server still times and locks up out after 3 requests.

@pfalcon pfalcon reopened this Apr 22, 2019
@pfalcon pfalcon changed the title net: QEMU Ethernet drivers are flaky net: QEMU Ethernet drivers are flaky after "net_buf" refactor (actually, whole-stack shakeover) Apr 22, 2019
@bravegnu
Copy link
Contributor

@pfalcon Is this witnessed only with the stellaris driver or with other drivers as well?

@pfalcon
Copy link
Contributor Author

pfalcon commented Apr 25, 2019

@bravegnu: The current picture I have in my head re: 3 eth qemu drivers is:

  • smsc911x works pretty well
  • e1000 throws weird pkt alloc warnings and sometimes hangs
  • stellaris hangs on 3rd apache bench request

But that's somewhat dated picture, I'd encourage to try all 3 drivers with current codebase and see.

@pfalcon
Copy link
Contributor Author

pfalcon commented Oct 11, 2019

Retesting with 34b95fe (2.1.0-wip).

All tests below with dumb_http_server sample. (Results posted in individual comment for easier review/reference).

@pfalcon
Copy link
Contributor Author

pfalcon commented Oct 11, 2019

qemu_x86 + overlay-e1000.conf

$ ab -n1000 http://192.0.2.1:8080/
[]
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
apr_socket_recv: Connection reset by peer (104)
Total of 738 requests completed

Pings don't work after this.

On Zephyr console side:

$ grep "Out of memory" e1000.log | wc
    296    2664   20424

I.e. there were 296 errors like:

[00:06:18.040,000] <err> eth_e1000: Out of memory for received frame

During this run.

@pfalcon
Copy link
Contributor Author

pfalcon commented Oct 11, 2019

mps2_an385 and overlay-smsc911x.conf

$ ab -n1000 http://192.0.2.1:8080/
[]
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        2122 bytes

Concurrency Level:      1
Time taken for tests:   212.802 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      2181000 bytes
HTML transferred:       2122000 bytes
Requests per second:    4.70 [#/sec] (mean)
Time per request:       212.802 [ms] (mean)
Time per request:       212.802 [ms] (mean, across all concurrent requests)
Transfer rate:          10.01 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    1   0.4      1       8
Processing:   208  212   4.6    209     244
Waiting:        5    6   1.6      5      30
Total:        209  213   4.8    210     249

Percentage of the requests served within a certain time (ms)
  50%    210
  66%    210
  75%    220
  80%    220
  90%    220
  95%    220
  98%    220
  99%    221
 100%    249 (longest request)

Pings work afterwards.

Nothing special on Zephyr console side.

@pfalcon
Copy link
Contributor Author

pfalcon commented Oct 11, 2019

qemu_cortex_m3 and overlay-qemu_cortex_m3_eth.conf

$ ab -n1000 http://192.0.2.1:8080/
[]
Benchmarking 192.0.2.1 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
Total of 3 requests completed

On Zephyr console:

[QEMU] CPU: cortex-m3
***** Booting Zephyr OS build zephyr-v1.14.0-4643-g34b95feb54aa *****
[00:00:00.000,000] <inf> net_config: Initializing network
[00:00:00.000,000] <inf> net_config: IPv4 address: 192.0.2.1
Single-threaded dumb HTTP server waits for a connection on port 8080...
Connection #0 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #1 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #2 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #3 from 192.0.2.2
Connection from 192.0.2.2 closed

Surprisingly, this run pings work afterwards (usually it's completely deadlocked).

$ ping 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=3077 ms
64 bytes from 192.0.2.1: icmp_seq=2 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=3 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=4 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=5 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=6 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=7 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=8 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=9 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=10 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=11 ttl=64 time=3072 ms

Look how perfect multiple of powers of 2 the time is!

Let's restart with sample and start with pings (output abridged):

$ ping 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=5.23 ms
64 bytes from 192.0.2.1: icmp_seq=2 ttl=64 time=0.536 ms
64 bytes from 192.0.2.1: icmp_seq=3 ttl=64 time=0.666 ms
[]
64 bytes from 192.0.2.1: icmp_seq=33 ttl=64 time=0.708 ms
64 bytes from 192.0.2.1: icmp_seq=34 ttl=64 time=0.555 ms
64 bytes from 192.0.2.1: icmp_seq=35 ttl=64 time=0.408 ms
64 bytes from 192.0.2.1: icmp_seq=36 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=37 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=38 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=39 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=40 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=41 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=42 ttl=64 time=1028 ms
64 bytes from 192.0.2.1: icmp_seq=43 ttl=64 time=392 ms
64 bytes from 192.0.2.1: icmp_seq=44 ttl=64 time=2035 ms
64 bytes from 192.0.2.1: icmp_seq=45 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=46 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=47 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=48 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=49 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=50 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=51 ttl=64 time=2052 ms
64 bytes from 192.0.2.1: icmp_seq=52 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=53 ttl=64 time=2044 ms
64 bytes from 192.0.2.1: icmp_seq=54 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=55 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=56 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=57 ttl=64 time=2048 ms

I.e., it starts ok, then 1024ms delays adds in. Then 2048ms, then, as we saw, 3072ms, etc.

That's all is nothing new, we saw a similar behavior with frdm_k64f, perhaps imx1050, etc. It's unclear (at least to me) where this power-of-2 delay make come from. For explicit delays, we usually use power-of-10 delays, like 500ms, 1000ms.

@pfalcon pfalcon changed the title net: QEMU Ethernet drivers are flaky after "net_buf" refactor (actually, whole-stack shakeover) net: QEMU Ethernet drivers are flaky (seemingly after "net_buf" refactor) Oct 22, 2019
@pfalcon
Copy link
Contributor Author

pfalcon commented Oct 22, 2019

I tried to bisect the tree to check behavior of eth_stellaris before the net_buf refactor. But the situation is that it effectively was merged in the midst of it, and initially, not fully operational, as it needed updates for already merged refactoring patches. In other words, there's no in-tree bisect point, where the driver worked better than it does now, and in early days it was not buildable at all. Trying to fix build issues, I got runtime crashes instead.

I also tried to compare driver source with eth_smsc911x, which works the best so far (for me), and also look into Stellaris datasheet, but I couldn't see anything obviously wrong.

So, I put this aside for now again. (My idea was to get network tests to work on old qemu as shipped by distro packages, but upgrading to Zephyr's QEMU seems like better option given all the issues above.)

@jukkar
Copy link
Member

jukkar commented Jan 28, 2020

Going through old issues. This is a good candidate for closing. Can be re-opened if really needed.

@jukkar jukkar closed this as completed Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Ethernet area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants