-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Support]: sudden disappearance of user-space TCP streams #286
Comments
Hi xguerin and thanks for choosing ec2!
best regards, |
Thanks @shaibran for your response! Here are the pieces of information requested:
Some timestamps where we lost activity (time in EST):
I'll edit that comment with some other occurrences as they happen. |
Could reprogramming the RETA dynamically explain this behavior? When the connections are live I don't see any spurious loss of connectivity. But then, every once in while, I recycle a subset of these connections, and then all hell break loose. Edit no material difference. |
I set up mirroring to analyze the traffic from the VPC standpoint. For some reason, I could not use my c6i, so I configured a c5n instead. The problem has virtually disappeared, as I am seeing an order of magnitude less disconnections. (no problem on c5 or c6in either). |
Hi, some update on the check on ec2 side: |
Thanks @shaibran for the update. I will keep an eye on those queues, and update to 23.11 so I can track the overruns. However, even if the reader ended up dropping packets, it does not explain why the flow stopped abruptly. Nor why retransmission and keep-alive requests would not get a response from the peer. More interestingly, I can say with a high degree of confidence that this issue is happening only on |
thanks for the important input, we will look into this issue |
Hello, Following up on this, thankfully to the unexpected link created with issue #9. We do face similar issue with netmap for quite a while now (on FreeBSD) :
This affecte randomly our customer's virtual appliance deployed on ec2. Surprisingly, using netmap in emulated mode does not helps and let us think this issue is with the ENA driver itself. Kind regards |
The dpdk and freeBsd drivers are completely different, so it is unlikely that this is driver related. I will have the freeBSD developers look into this, but the log you shared might indicate that the threshold for refilling the rx ring should be modified, this might be more visible when the application does not release processed ingress packets fast enough so the device will not have free buffers in the rx ring. |
Hi @alexissavin,
Also the print "Bad buff in slot" comes from the function that allocates buffs for the rx ring in netmap mode which fails. So the driver is failing to refill the RX ring. And thus it retries over and over again. Could you please share reproduction steps to akiyano@amazon.com so that we could further investigate? |
Many thanks for your feedback. I'm unfortunately unable to share an STR without providing you with our own software at the moment. I'm trying to reproduce using pkt-gen and I will let you know the results by email or in a separate issue to avoid pollute this thread which seems not related. I only want to share that the recent version of the ena driver does not seem to crash any longer in emulated mode (admode=2) in its latest version (Elastic Network Adapter (ENA)ena v2.6.3). However, it still crash rapidly in native mode (admode=1) with about 20 000 QPS. So I do believe the issue is within the driver but most likely in a section related to the netmap support. Kind regards |
resolving this issue, feel free to reopen (on the specific ena driver) if needed |
Preliminary Actions
Driver Type
DPDK PMD for Elastic Network Adapter (ENA)
Driver Tag/Commit
librte-net-ena23/mantic,now 22.11.3-1
Custom Code
No
OS Platform and Distribution
Linux 6.5.0-1009-aws #9-Ubuntu
Support request
This is probably a long shot, so I apologize in advance for the seemingly broad scope of my question.
I use the ENA PMD is conjunction with a user-space TCP/IP stack. Under heavy load (200+ connections), the device simply stop receiving packets for certain connections. By that I mean literally no packets: no retransmissions, no RST, nothing. Other active connections work just fine. Of course, a similar setup running on kernel sockets does not show that behavior. All the connections are long-lived and rarely reset.
I analyzed packet dumps collected off the device DPDK interface (tx_burst and rx_burst) and the affected streams look kosher: ACKs happen on time (either quick or delayed) and the window is properly advertised.
The queue is configured with its maximum RX and TX buffers (I'm running on a c6i so 1024 TX and 2048 RX) and the mempools are on average underutilized (so no buffer overrun). Driver stats don't report any error either. The
igb_uio
driver is loaded withwc_activate=1
.Packets for each connections are routed using their toeplitz hash in the device's RETA. The read logic looks like this:
All IRQs are bound to CPU0 and the application is running on CPUs 1-3, using
isolcpus
. The instance is configured with 1 HW thread. There is virtually no starvation.My question basically is: is there a chance, even minute, that such packet "disappearance" could have a low-level root cause, either from the HW or the PMD? Or or misconfiguration/misuse of the PMD?
Thanks,
The text was updated successfully, but these errors were encountered: