net: QEMU Ethernet drivers are flaky (seemingly after "net_buf" refactor) #13943

pfalcon · 2019-02-28T12:32:31Z

Testing 3 qemu ethernet drivers we added in 1.14 timeframe with the idea to use them to test the IP stack better.

master: e731bdc
sample: samples/net/sockets/dumb_http_server
test command on host: ab -n1000 http://192.0.2.1:8080/

BOARD=qemu_x86 OVERLAY=overlay-e1000.conf
Not a single request succeeds:

$ ab -n1000 http://192.0.2.1:8080/
...
Benchmarking 192.0.2.1 (be patient)
apr_socket_recv: Connection reset by peer (104)

On Zephyr console (after ~5 ab attempts):

Booting from ROM..***** Booting Zephyr OS zephyr-v1.13.0-5221-g8ecd348022f9 *****
[00:00:00.019,243] <inf> net_config: Initializing network
[00:00:00.021,990] <inf> net_config: IPv4 address: 192.0.2.1
Single-threaded dumb HTTP server waits for a connection on port 8080...
[00:00:59.445,932] <err> eth_e1000: Out of memory for received frame

BOARD=mps2_an385 OVERLAY=overlay-smsc911x.conf
Zephyr:

Connection #137 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #138 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #139 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #140 from 192.0.2.2
...(lockup, pings don't work)...

BOARD=qemu_cortex_m3 OVERLAY=overlay-qemu_cortex_m3_eth.conf
Build error:

[ 54%] Building C object zephyr/CMakeFiles/zephyr.dir/drivers/ethernet/eth_stellaris.c.obj
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c: In function 'eth_stellaris_stats':
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c:291:24: error: 'struct eth_stellaris_runtime' has no member named 'stats'
  return &(DEV_DATA(dev)->stats);
                        ^~
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c: At top level:
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c:349:2: error: unknown field 'get_stats' specified in initializer
  .get_stats = eth_stellaris_stats,
  ^
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c:349:15: warning: initialization from incompatible pointer type [-Wincompatible-pointer-types]
  .get_stats = eth_stellaris_stats,
               ^~~~~~~~~~~~~~~~~~~
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c:349:15: note: (near initialization for 'eth_stellaris_apis.start')
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c: In function 'eth_stellaris_stats':
/home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/drivers/ethernet/eth_stellaris.c:292:1: warning: control reaches end of non-void function [-Wreturn-type]
 }
 ^

The text was updated successfully, but these errors were encountered:

pfalcon · 2019-02-28T12:34:40Z

@rlubos, @rveerama1, @aurel32 FYI

pfalcon · 2019-02-28T13:12:41Z

qemu_cortex_m3 OVERLAY=overlay-qemu_cortex_m3_eth.conf
Build error:

Patch submitted: #13945

With it:

Single-threaded dumb HTTP server waits for a connection on port 8080...
Connection #0 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #1 from 192.0.2.2
Connection from 192.0.2.2 closed
[lockup]

jukkar · 2019-02-28T16:02:45Z

Tried with echo-server, UDP works just fine but TCP refuses to connect.

pfalcon · 2019-02-28T16:57:46Z

@jukkar

Tried with echo-server, UDP works just fine but TCP refuses to connect.

Thanks for testing, what BOARD it was?

jukkar · 2019-02-28T17:33:48Z

Thanks for testing, what BOARD it was?

Just qemu_x86

Multiple flag bits were set so the ACK flag set was not checked properly which meant that connection establishment was not successfull. Fixes zephyrproject-rtos#13943 Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>

Multiple flag bits were set so the ACK flag set was not checked properly which meant that connection establishment was not successfull. Fixes #13943 Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>

pfalcon · 2019-03-01T16:29:51Z

#13957 improved, but not fixed the situation. Reopening to track the situation over time.

pfalcon · 2019-03-01T17:00:26Z

Master 5f2099f

BOARD=qemu_x86 OVERLAY=overlay-e1000.conf

$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
apr_socket_recv: Connection reset by peer (104)
Total of 64 requests completed

$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
apr_socket_recv: Connection reset by peer (104)
Total of 262 requests completed

I.e. of 2 tests, they abort due to ECONNRESET, which is worse then my yesterday test for #13957 (no implications, it's just visibly flaky and non-deterministic). But at least Zephyr didn't lock up.

BOARD=mps2_an385 OVERLAY=overlay-smsc911x.conf

Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 434 requests completed

After that, Zephyr is locked up.

pfalcon · 2019-03-27T19:01:24Z

I finally did a new test sweep-over against revision 9722214. The results can be found in https://docs.google.com/spreadsheets/d/1_8CsACPEXqrMIbxBKxPAds091tNAwnwdWkMKr3994QY/edit#gid=0 as usual. Summary is:

eth_stellaris (qemu_cortex_m3) is completely broken. It sends packets with 14 extra bytes, which points to some issues with Ethernet header handling.
eth_e1000 mostly works, though I had a case of it resetting connection with dumb_http_server. And there're regular "Out of memory for received frame" error logging.
eth_smsc911x works without issues, yay.

laperie · 2019-04-05T08:54:20Z

assigning to @pfalcon -- for stellaris it must be assigned to @bravegnu (see @pfalcon comment above)but somehow I cannot do that in the github interface

laperie · 2019-04-05T08:58:15Z

also lowering a priority since there's only one driver that is not working

tbursztyka · 2019-04-05T08:59:36Z

Contacting eth_stellaris maintainer

bravegnu · 2019-04-05T16:02:07Z

Going through the change history, this change bff65b6#diff-36181bf0c30748729a35e42e076ac604L72 seemed to have introduced a regression. The 2 byte data_len written first should not include the header size.

As part of the ll_reserve refactoring effort, the packet length now includes header size as well. Before the refactor, when the packet length was written to the device, it did not include the header size, which is the required value as per the LM3S6965 datasheet. After the refactor the packet length includes the header size as well. The header size has to subtracted from the packet length before writing to the device. Fixes zephyrproject-rtos#13943. Signed-off-by: Vijay Kumar B <vijaykumar@zilogic.com>

As part of the ll_reserve refactoring effort, the packet length now includes header size as well. Before the refactor, when the packet length was written to the device, it did not include the header size, which is the required value as per the LM3S6965 datasheet. After the refactor the packet length includes the header size as well. The header size has to subtracted from the packet length before writing to the device. Fixes #13943. Signed-off-by: Vijay Kumar B <vijaykumar@zilogic.com>

pfalcon · 2019-04-22T20:48:59Z

Unfortunately, as of master 8e307a3, this issue isn't really fixed, running ab -n1000 http://192.0.2.1:8080/ against dumb_http_server/dumb_http_server still times and locks up out after 3 requests.

bravegnu · 2019-04-24T17:36:34Z

@pfalcon Is this witnessed only with the stellaris driver or with other drivers as well?

pfalcon · 2019-04-25T11:09:59Z

@bravegnu: The current picture I have in my head re: 3 eth qemu drivers is:

smsc911x works pretty well
e1000 throws weird pkt alloc warnings and sometimes hangs
stellaris hangs on 3rd apache bench request

But that's somewhat dated picture, I'd encourage to try all 3 drivers with current codebase and see.

pfalcon · 2019-10-11T08:03:05Z

Retesting with 34b95fe (2.1.0-wip).

All tests below with dumb_http_server sample. (Results posted in individual comment for easier review/reference).

pfalcon · 2019-10-11T08:08:32Z

qemu_x86 + overlay-e1000.conf

$ ab -n1000 http://192.0.2.1:8080/
[]
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
apr_socket_recv: Connection reset by peer (104)
Total of 738 requests completed

Pings don't work after this.

On Zephyr console side:

$ grep "Out of memory" e1000.log | wc
    296    2664   20424

I.e. there were 296 errors like:

[00:06:18.040,000] <err> eth_e1000: Out of memory for received frame

During this run.

pfalcon · 2019-10-11T08:16:50Z

mps2_an385 and overlay-smsc911x.conf

$ ab -n1000 http://192.0.2.1:8080/
[]
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        2122 bytes

Concurrency Level:      1
Time taken for tests:   212.802 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      2181000 bytes
HTML transferred:       2122000 bytes
Requests per second:    4.70 [#/sec] (mean)
Time per request:       212.802 [ms] (mean)
Time per request:       212.802 [ms] (mean, across all concurrent requests)
Transfer rate:          10.01 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    1   0.4      1       8
Processing:   208  212   4.6    209     244
Waiting:        5    6   1.6      5      30
Total:        209  213   4.8    210     249

Percentage of the requests served within a certain time (ms)
  50%    210
  66%    210
  75%    220
  80%    220
  90%    220
  95%    220
  98%    220
  99%    221
 100%    249 (longest request)

Pings work afterwards.

Nothing special on Zephyr console side.

pfalcon · 2019-10-11T08:38:51Z

qemu_cortex_m3 and overlay-qemu_cortex_m3_eth.conf

$ ab -n1000 http://192.0.2.1:8080/
[]
Benchmarking 192.0.2.1 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
Total of 3 requests completed

On Zephyr console:

[QEMU] CPU: cortex-m3
***** Booting Zephyr OS build zephyr-v1.14.0-4643-g34b95feb54aa *****
[00:00:00.000,000] <inf> net_config: Initializing network
[00:00:00.000,000] <inf> net_config: IPv4 address: 192.0.2.1
Single-threaded dumb HTTP server waits for a connection on port 8080...
Connection #0 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #1 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #2 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #3 from 192.0.2.2
Connection from 192.0.2.2 closed

Surprisingly, this run pings work afterwards (usually it's completely deadlocked).

$ ping 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=3077 ms
64 bytes from 192.0.2.1: icmp_seq=2 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=3 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=4 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=5 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=6 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=7 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=8 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=9 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=10 ttl=64 time=3072 ms
64 bytes from 192.0.2.1: icmp_seq=11 ttl=64 time=3072 ms

Look how perfect multiple of powers of 2 the time is!

Let's restart with sample and start with pings (output abridged):

$ ping 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=5.23 ms
64 bytes from 192.0.2.1: icmp_seq=2 ttl=64 time=0.536 ms
64 bytes from 192.0.2.1: icmp_seq=3 ttl=64 time=0.666 ms
[]
64 bytes from 192.0.2.1: icmp_seq=33 ttl=64 time=0.708 ms
64 bytes from 192.0.2.1: icmp_seq=34 ttl=64 time=0.555 ms
64 bytes from 192.0.2.1: icmp_seq=35 ttl=64 time=0.408 ms
64 bytes from 192.0.2.1: icmp_seq=36 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=37 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=38 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=39 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=40 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=41 ttl=64 time=1024 ms
64 bytes from 192.0.2.1: icmp_seq=42 ttl=64 time=1028 ms
64 bytes from 192.0.2.1: icmp_seq=43 ttl=64 time=392 ms
64 bytes from 192.0.2.1: icmp_seq=44 ttl=64 time=2035 ms
64 bytes from 192.0.2.1: icmp_seq=45 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=46 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=47 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=48 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=49 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=50 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=51 ttl=64 time=2052 ms
64 bytes from 192.0.2.1: icmp_seq=52 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=53 ttl=64 time=2044 ms
64 bytes from 192.0.2.1: icmp_seq=54 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=55 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=56 ttl=64 time=2048 ms
64 bytes from 192.0.2.1: icmp_seq=57 ttl=64 time=2048 ms

I.e., it starts ok, then 1024ms delays adds in. Then 2048ms, then, as we saw, 3072ms, etc.

That's all is nothing new, we saw a similar behavior with frdm_k64f, perhaps imx1050, etc. It's unclear (at least to me) where this power-of-2 delay make come from. For explicit delays, we usually use power-of-10 delays, like 500ms, 1000ms.

pfalcon · 2019-10-22T15:26:21Z

I tried to bisect the tree to check behavior of eth_stellaris before the net_buf refactor. But the situation is that it effectively was merged in the midst of it, and initially, not fully operational, as it needed updates for already merged refactoring patches. In other words, there's no in-tree bisect point, where the driver worked better than it does now, and in early days it was not buildable at all. Trying to fix build issues, I got runtime crashes instead.

I also tried to compare driver source with eth_smsc911x, which works the best so far (for me), and also look into Stellaris datasheet, but I couldn't see anything obviously wrong.

So, I put this aside for now again. (My idea was to get network tests to work on old qemu as shipped by distro packages, but upgrading to Zephyr's QEMU seems like better option given all the issues above.)

jukkar · 2020-01-28T18:54:56Z

Going through old issues. This is a good candidate for closing. Can be re-opened if really needed.

pfalcon added bug The issue is a bug, or the PR is fixing a bug area: Networking labels Feb 28, 2019

pfalcon assigned jukkar and tbursztyka Feb 28, 2019

pfalcon added this to the v1.14.0 milestone Feb 28, 2019

pfalcon added priority: medium Medium impact/importance bug area: Ethernet labels Feb 28, 2019

jukkar mentioned this issue Feb 28, 2019

net: tcp: Check TCP ACK flag properly during conn establishment #13957

Merged

nashif closed this as completed in #13957 Mar 1, 2019

pfalcon reopened this Mar 1, 2019

pfalcon mentioned this issue Mar 25, 2019

samples: sockets: Add e1000 and stellaris ethernet drivers overlays #14893

Merged

pfalcon mentioned this issue Mar 29, 2019

TCP: fix deadlock on slow peer #14944

Closed

laperie assigned pfalcon and unassigned jukkar and tbursztyka Apr 5, 2019

laperie unassigned pfalcon Apr 5, 2019

laperie added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Apr 5, 2019

laperie assigned pfalcon Apr 5, 2019

tbursztyka changed the title ~~net: QEMU Ethernet drivers are flaky after "net_buf" refactor (actually, whole-stack shakeover)~~ net: QEMU Ethernet drivers are flaky Apr 5, 2019

bravegnu mentioned this issue Apr 7, 2019

ethernet: eth_stellaris: data_len should not include header size #15241

Merged

nashif closed this as completed in #15241 Apr 11, 2019

pfalcon mentioned this issue Apr 22, 2019

ethernet/eth_e1000.c: change to new PCI(e) functions #15561

Merged

pfalcon reopened this Apr 22, 2019

pfalcon changed the title ~~net: QEMU Ethernet drivers are flaky~~ net: QEMU Ethernet drivers are flaky after "net_buf" refactor (actually, whole-stack shakeover) Apr 22, 2019

pfalcon mentioned this issue Oct 10, 2019

net: tcp: When closing the connection send FIN immediately #19747

Merged

pfalcon mentioned this issue Oct 11, 2019

Noticeable delay between processing multiple client connection requests (200ms+) #19678

Closed

pfalcon changed the title ~~net: QEMU Ethernet drivers are flaky after "net_buf" refactor (actually, whole-stack shakeover)~~ net: QEMU Ethernet drivers are flaky (seemingly after "net_buf" refactor) Oct 22, 2019

pfalcon mentioned this issue Oct 22, 2019

Need to install Zephyr SDK's QEMU into dispatcher container Linaro/lite-lava-docker-compose#30

Closed

pfalcon removed this from the v1.14.0 milestone Oct 23, 2019

jukkar closed this as completed Jan 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net: QEMU Ethernet drivers are flaky (seemingly after "net_buf" refactor) #13943

net: QEMU Ethernet drivers are flaky (seemingly after "net_buf" refactor) #13943

pfalcon commented Feb 28, 2019

pfalcon commented Feb 28, 2019

pfalcon commented Feb 28, 2019

jukkar commented Feb 28, 2019

pfalcon commented Feb 28, 2019

jukkar commented Feb 28, 2019

pfalcon commented Mar 1, 2019

pfalcon commented Mar 1, 2019

pfalcon commented Mar 27, 2019

laperie commented Apr 5, 2019 •

edited

Loading

laperie commented Apr 5, 2019

tbursztyka commented Apr 5, 2019

bravegnu commented Apr 5, 2019

pfalcon commented Apr 22, 2019

bravegnu commented Apr 24, 2019

pfalcon commented Apr 25, 2019

pfalcon commented Oct 11, 2019

pfalcon commented Oct 11, 2019

pfalcon commented Oct 11, 2019

pfalcon commented Oct 11, 2019

pfalcon commented Oct 22, 2019

jukkar commented Jan 28, 2020

net: QEMU Ethernet drivers are flaky (seemingly after "net_buf" refactor) #13943

net: QEMU Ethernet drivers are flaky (seemingly after "net_buf" refactor) #13943

Comments

pfalcon commented Feb 28, 2019

pfalcon commented Feb 28, 2019

pfalcon commented Feb 28, 2019

jukkar commented Feb 28, 2019

pfalcon commented Feb 28, 2019

jukkar commented Feb 28, 2019

pfalcon commented Mar 1, 2019

pfalcon commented Mar 1, 2019

pfalcon commented Mar 27, 2019

laperie commented Apr 5, 2019 • edited Loading

laperie commented Apr 5, 2019

tbursztyka commented Apr 5, 2019

bravegnu commented Apr 5, 2019

pfalcon commented Apr 22, 2019

bravegnu commented Apr 24, 2019

pfalcon commented Apr 25, 2019

pfalcon commented Oct 11, 2019

pfalcon commented Oct 11, 2019

pfalcon commented Oct 11, 2019

pfalcon commented Oct 11, 2019

pfalcon commented Oct 22, 2019

jukkar commented Jan 28, 2020

laperie commented Apr 5, 2019 •

edited

Loading