-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mounting several volumes at once fails very frequently #125
Comments
Hey @tsmetana , Are you trying to mount multiple different file system or same file system? Can you try Also if you have the stunnel debug log, feel free to paste it here. Thanks. |
Hi. Thanks for the quick response. It was EC2. I don't have the debug logs since raising the debug level makes the problem hard to reproduce (I may try...). And it didn't occur to me to try to kill the stunnel process. IIRC re-running the mount.nfs manually helped. But none of that would be a viable workaround... As mentioned -- this is originally coming from the EFS CSI driver. So just creating several EFS volumes (PVs) and deleting them has high chance of failing (since deleting them means to mount them on kubernetes master). I have straces of the waiting stunnel and RPC logs but there is nothing obvious in those. |
Thanks. kvifern@ provided a workaround that launch a unit to monitor the file system in kubernetes-sigs/aws-efs-csi-driver#616 (comment). Killing the stunnel process and watchdog will relaunch a new stunnel which will reconnect to server. You can try that and verify if that is working. Meanwhile we are looking this kind of issue right now actively. |
Also I wonder if you can upgrade your stunnel version to 5.63, I left a comment in kubernetes-sigs/aws-efs-csi-driver#616 (comment), based on our finding so far, the issue could be related to the stunnel itself. We are trying to reproduce on our side as well with various version. Let me know, thanks. |
I forgot to mention: I have also tried stunnel-5.62 (rebuilt from latest Fedora) with Centos 8 and it didn't help: the mounts still got hung. Btw. the problem in kubernetes-sigs/aws-efs-csi-driver#616 is slightly different from what I'm observing: I don't even get to mount the filesystem. Though I also suspect they might be related. |
Next time when you encountered the problem, can you paste the stunnel log here in case you have? Thanks. |
Hi. Sorry for the delayed responses. I got to retesting this finally. I used only an EC2 VM with Centos 8 Streams (stunnel-5.56-5.el8_3.x86_64, openssl-libs-1.1.1k-6.el8.x86_64) and latest efs-utils from git head (amazon-efs-utils-1.32.1-1.el8.noarch from make rpm). If I try to mount 30 directories at once from an EFS volume subdirectories, almost all the time some of them fails to mount. The error from
So the culprit seems to be It goes in sequence through suitable ports, tries to bind each, if successful, closes the port and says this port should be used in the stunnel configuration file, which then gets written. And if there's several mount.efs processes running in parallel there's quite a lot of time between the port being closed and configuration written for other mount.efs to choose the same one because it's free for a moment (I found other mount from the same "batch" using the selected port). Then it seems like it never recovers from the failed stunnel startup. |
Thanks @tsmetana for the logs! What you described is valid. I think the port could be already taken since there is gap between launch stunnel and verify the tlsport is used, thus it is possible for two mount command use same tlsport and one of them timeout due to tls tunnel cannot be launched. efs-utils is an ad-hoc command line tool, so internally it does not handle this kind of concurrent mount well. Are you trying to mount them at once in /etc/fstab? Or it's just an application launch 30 threads and mount those directories. Before we suggested customer to mount the EFS sequentially in /etc/fstab: https://docs.aws.amazon.com/efs/latest/ug/troubleshooting-efs-mounting.html#automount-fix-multiple-fs Another thought is to increase the port range in config file, currently there are 400 ports, you can bump it to 4000 ports, so it will decrease the possibility of conflict tlsport. Will engage someone to take a look on this kind of issue. |
Hi. The test case is basically this:
Please note that the parallel mount is what Kubernetes would do (by default) in CSI driver / provisioner, so even though there is a documented workaround for people using EFS mounts in fstab, we still have quite an unreliable (or slow to mount) EFS in Kubernetes. As for the fix I had several ideas but I think the most reliable one would be to account for this error, choose a different port and re-start stunnel. It may be also possible to close the socket only after the config file is written just berfore stunnel starts. Since the port is included in the name of the stunnel config file name, we can then include also the check for the existence of such config when selecting the port number and don't attempt to use the port for which another stunnel has been configured already. Not sure how many changes would it take though... :) I may try myself eventually. |
Still broken in v1.33.1. |
I've noticed this still hitting the race condition with v1.33.1 (note the duplicate “binding 20417” below). What's interesting is that while an error was logged, the mount did not fail. If it had, our automated retries would have picked it up. What instead happened was a very confusing situation where the filesystem actually mounted was the other one, so the mount appeared to be successful but all of the contents were wrong. Trapping the stunnel error and either retrying or failing the mount attempt would have made this a problem lasting only a few seconds.
|
In the logs you are posting, I can see that the one mounting to
|
This is partially correct: stunnel fails to start but that error doesn't appear to be handled other than by logging it — the mount operation continues and ultimately reports success.
In the example above, |
Closing as we've resolved this issue with the v1.34.4 release |
We're encountering problems when trying to mount multiple EFS volumes at once. The mount process gets stuck, when trying to debug RPC there are occassional
nfs: server 127.0.0.1 not responding, timed out
errors in the log (not sure if those are related -- the mount.efs should retry AFAIK). The stunnel processes serving the mount RPC connections seem to be just waiting for for connection but nothing happens.This problem has been observed only with Centos 8 (or Centos Streams 8) running stunnel-5.56-5.el8_3 and openssl-libs-1.1.1k-5.el8_5. When trying Amazon Linux 2 with stunnel-4.56-6.amzn2.0.3 and openssl-libs-1.0.2k-19.amzn2.2.0.10 everything works OK. I suspected this is a race in stunnel, so I've tried to recompile stunnel-5.56-5 and install it on Amazon Linux 2 but the issue is again not reproducible, so it's not stunnel (or not stunnel itself).
The issue seems to be also quite timing-sensitive. Increasing log level or changing stunnel options seems to have effect on probability of the problem to show. I've tried to remove PID file creation (since the issue #112 looked quite similar) but it doesn't seem to help -- I can still see the pending mounts. I also suspected the issue #105, but even if I fixed that (I hope -- PR #119) the mounts still get stuck.
I wonder if the problem in issue #114 is related: we're mostly encountering this problem through efs-csi-driver on kubernetes clusters when creating and removing multiple EFS volumes in the cluster in one shot.
I'm curious if somebody had more insight or encountered the problem: it looks like it's the combination of multiple factors that cause this and I failed to find any interesting debugging clues.
The text was updated successfully, but these errors were encountered: