-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: always_softlink for conda create --clone
#7867
Comments
Soft links will not work because code will call realpath on them and mess them up. |
@mingwandroid thank you. Is there is a way do enforce keeping softlinks? |
Do you think if I was to interpose this
The strength of your feelings on this subject are not relevant because it is technically unsupportable. Even when we use |
http://man7.org/linux/man-pages/man3/realpath.3.html I understand how it works. I am not asking to change that function :-) What I was asking if
Thanks. |
Please read what I wrote carefully. It is not conda that is calling realpath that is the problem. It is the thousands of packages we provide that can and will call realpath, and when they do that, they will break. |
Well, we used to allow symlinks for executables, but we had to stop doing that because To explain that more, any of the hundreds of packages we build may call Please stop insisting there's a way this can be made to work, there is not. |
Thanks @mingwandroid - I understand now. You haven't referenced "thousands of packages" in your previous response, so I was under impression we were talking just about I am surprised some packages would call |
Hmm, I'm not even talking about just python packages and wrt python packages I am not talking about realpath before importing a package, I am talking about any use of the glibc |
.. your real problem is that it is slow to create envs when copying is used and the source is NFS. If you want to file a bug along those lines then we can think about ways to improve that. One idea would be to copy the tarball and extract that locally, though decompressing .tar.bz2 is very slow so that may be something that needs to wait until such a time as we change to a compression format with much faster decompression speeds. Still such an 'improvement' would be somewhat site-specific, i.e. slow end-user CPUs with very fast networks vs fast end-user CPUs with very slow networks. |
Thank you a lot for the detailed explanation there. Some of those details are beyond my knowledge of Thank you for providing gcc example - I just did
That's very interesting, I am not sure I understand why gcc (or some other packages would need to know realpath) as they can keep working with the symlinks, and glibc would dereference symlink each time transparently to an application (or to any of those thousands of packages).
I will think if I have other ideas. On decompressing that's an option too. Although it would be very slow as you mentioned. Big overhead each time application starts. Good points on cpu vs network. I agree. I also found https://github.com/conda/conda-build yesterday that seems might be somewhat helpful for some of our use cases, for example, for use cases with Spark on YARN - https://conda.github.io/conda-pack/spark.html as YARN can distribute packages as part of application submission. Although, again, I will think if there are better, more generic options can be possible here. Thanks again. |
Is that not a big concern when symlinking to your NFS server? Each file access will go across the network. I really believe that copying is the best option for this kind of a setup at present, yes, it takes more space, but you should gain a good speed increase vs going over the network? |
Nope. To start application up you need python, some shared libraries and some basic python packages. Those are not huge. Vs anaconda example I gave earlier which is almost 3Gb on-disk space. That would be a huge overhead to unpack those 3Gb for example. Most applications would use just a handful of packages from Anaconda, and after first NFS access, OS will cache those remote files most likely. So I expect it would be much faster with the NFS approach I was suggesting earlier. Thank you. |
The Anaconda installers aren't geared towards such uses. They are more for students or people new to Python who want a big collection of packages pre-installed without having to spend time figuring out how to use conda (though they really should!). It is a meta-package that consists of 200 actual packages. In my copy-the-package proposal It is just the needed packages from those 200 packages that would be sent over the network, not all 200 of them. I think you are probably better off using Miniconda than Anaconda, I mean who's ever going to run |
That's a totally valid point! |
Hi there, thank you for your contribution to Conda! This issue has been automatically locked since it has not had recent activity after it was closed. Please open a new issue if needed. |
We have a use case when we'd love to have an option for
conda create --clone
to force using softlinks across filesystems instead of copying files - what
conda create
does currently.Option can be called something like
force_softlinks
oralways_softlinks
like mentioned in #3373PS. In more details about the use case:
The use case is around distributed compute environments, where same conda virtual environment
has to be available on all workers nodes. We already have a shared NFS storage that's mounted on all workers nodes (same mount location). Currently creating and cloning users' conda environments in some cases takes a lot of time (some of them specify anaconda as a dependency, so it can be as big as 2.5Gb with hundreds of packages). All those worker nodes also have root conda environments that live on each worker too (another location but also consistent across all servers). If conda would allow to clone that root environment and use symlinks to the root environment (that's what they will be cloning), then amount of work required to create an environment will be much less. We expect it'll speed up
conda create --clone
quite a bit in this scenario. Also, it has a another nice side-effect - worker servers would be reading most of that conda environment locally and not from NFS share (it's okay for one server), but when tens of servers would start reading a few Gb from NFS, it easily adds up to significant network traffic.Thank you!
The text was updated successfully, but these errors were encountered: