Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENV is not thread safe with glibc #34726

Open
c42f opened this issue Feb 11, 2020 · 25 comments
Open

ENV is not thread safe with glibc #34726

c42f opened this issue Feb 11, 2020 · 25 comments

Comments

@c42f
Copy link
Member

c42f commented Feb 11, 2020

For some reason I was reading our ENV code, and I noticed we don't have a lock around our calls to getenv and setenv on linux, or for iterating the environment via the environ variable.

This makes ENV unsafe for use with multithreaded julia, as getenv and setenv are not mutually threadsafe in glibc. (See https://github.com/bminor/glibc/blob/master/stdlib/getenv.c and https://github.com/bminor/glibc/blob/master/stdlib/setenv.c. Unsurprisingly getenv is safe by itself, but mysteriously setenv is protected by a lock which getenv ignores! So you can mutually setenv safely, but nobody can presume to getenv elsewhere in the same multithreaded code.)

We can easily add some more locks on our side as mitigation, but unfortunately we can't really fix setenv without fixing glibc; a random C library we link against may decide to call getenv at any point. This definitely happens in practice and I've experienced it personally: OpenMathLib/OpenBLAS#716 (comment). There's a nice discussion on the rust issue tracker including this gem: rust-lang/rust#24741 (comment).

Some options:

  1. Mitigation: Put our own lock around getenv/setenv/environ access, and just hope that no C library we link against calls these functions itself. Add a big warning to the docs. This is the current rust approach though it's fragile.
  2. Avoidance: Ban using libc setenv entirely; create a shadow environment which is a copy of the system environment. This is the C# approach but with Julia's tradition of calling into C libraries I doubt that will work out (it creates surprises even in C#; see https://yizhang82.dev/set-environment-variable)
  3. Fix glibc??: Every portable language runtime has this exact problem, so people have tried fixing this in the past. In older times they were met with surprising hostility. The seemingly-current bug is marked suspended https://sourceware.org/bugzilla/show_bug.cgi?id=15607#c2 whatever that means.

For now the only easy / possible thing to do is to clearly option (1): add some locking and a big warning. If glibc could be fixed it could morph into a long term plan. More recent bugs suggest the glibc maintainers may possibly accept a patch.

As a side note, I feel we should consider removing withenv in the future because the shape of that API misrepresents reality. On the surface withenv appears to offer dynamic scoping, but ENV is global state. So withenv can never work reliably for concurrent code. That is, unless we took option (2) and avoid the C environment completely.

@ViralBShah
Copy link
Member

Does musl also have the same issue?

@c42f
Copy link
Member Author

c42f commented Feb 11, 2020

Does musl also have the same issue?

It certainly seems so. In fact there's even less locking:
https://git.musl-libc.org/cgit/musl/tree/src/env/getenv.c
https://git.musl-libc.org/cgit/musl/tree/src/env/setenv.c

I should point out that independently of any implementation this is a POSIX issue which doesn't require these functions to be threadsafe. See notes at http://man7.org/linux/man-pages/man3/setenv.3.html for example.

@c42f
Copy link
Member Author

c42f commented Feb 11, 2020

this is a POSIX issue

Having said that, I feel it's reasonable for libc implementations to mitigate this, because regardless of the posix standard, the way that C libraries use these functions in the wild has proven to be unreliable. But fixing this is not at all easy even inside glibc. As described here https://sourceware.org/bugzilla/show_bug.cgi?id=15607#c4 it may be necessary for glibc to leak some memory every time setenv is called because some other part of the process may be holding onto the result of a previous getenv.

Amusingly windows gets all this right and GetEnvironmentVariableW and friends just do what you'd expect and want with no fuss.

@ecsx1
Copy link

ecsx1 commented Feb 11, 2020

There's been some discussion on musl mailing list back in 2015 about getenv and thread-safety.
https://www.openwall.com/lists/musl/2015/03/04/11 cc @richfelker

@richfelker
Copy link

Putting locks around them does not make thread-unsafe functions thread-safe. A fairly large number of libc functions are specified to use the environment, and if you modify the environment, via setenv or direct manipulation of environ, any concurrent call to one of them produces a data race and thereby undefined behavior. musl is pretty strict about not introducing environment access into functions not specified to depend on the environment, but glibc is not, so it's hard to even make a viable approach of adding locks all the functions that might use the environment.

The right answer is simply that you don't modify the environment; it's effectively immutable input state of the program. To execute another program with changes to the environment, you use one of the exec/posix_spawn family functions that takes a complete new environment as an argument rather than editing your own environment first. If you have a language interpreter/runtime that allows logical modification to the environment (like the shell does), you don't actually modify your own environment, but keep a separate logical copy that's accessed under appropriate locks that you expose back to the hosted program and export to child processes when execing them.

The only problem this does not solve is a need to set environment variables that change behavior of library functions within the current process, like TZ.

@Keno
Copy link
Member

Keno commented Feb 12, 2020

Our process spawning already operates on copies of the environment, so that's not the issue (though as mentioned we do still have withenv and of course the ENV["FOO"] = "BAR" accessors that do call the underlying libc setenv). The big problem is the last issue mentioned:

The only problem this does not solve is a need to set environment variables that change behavior of library functions within the current process, like TZ.

C libraries (not just the libc) unfortunately often like to use the environment for configuration options that we dynamically need to change. Where we have influence over the C library in question, I think we should make it a policy for these libraries to have an API other than the environment for setting configuration options.

For other libraries that we don't control and that are unwilling to take such patches, I'm not sure what to do though. Perhaps the only thing we can do is put a big scary warning in the docs like Rust did.

There's also the question what to do with withenv. In theory, we could just have that be a task-local immutable overlay over the actual process environment. That would achieve the dynamically-scoped environment manipulation that I do think is probably useful for people using Julia for shell-like tasks. Of course at that point, we'd probably still need a way to modify the global environment (unsafe_setenv!?) for aforementioned miscellaneous C libraries, but perhaps that's a better situation in that it signals to users to avoid this interface if possible.

@c42f
Copy link
Member Author

c42f commented Feb 12, 2020

Thanks @richfelker for your perspective! You're right; the existence of direct access to environ and the fact that getenv returns a pointer to memory owned by libc (or the system) leaves no reasonable solution in sight for *libc. Either

  1. Using the result of getenv in the presence of setenv leads to a data race or
  2. setenv would have to permanently leak the memory from the previous environment.

We can somewhat-mitigate this from within julia by using a lock and copying the result of getenv into our own buffer before releasing the lock, but in no way does that fix the problem globally.

Also thanks for pointing out something which I hadn't considered; that setenv has an implicit data race with several libc functions. (The list for musl looks very short: https://wiki.musl-libc.org/environment-variables.html)

The only problem this does not solve is a need to set environment variables that change behavior of library functions within the current process, like TZ.

Exactly. I imagine this is why Julia's ENV currently uses setenv the way it does. We're now in a sticky situation because changing to an immutable environment model would leave no way to control libraries which rely on the environment. Perhaps we could backpedal a bit in future and expose an unsafe_setenv to change the C environment, but make ENV modify a copy of it.

If we'd prefer the leaky approach in julia, another nasty idea is to take matters into our own hands on linux and point environ to some memory we own. Of course then it's not clear what to do if some C library calls setenv.

@c42f
Copy link
Member Author

c42f commented Feb 12, 2020

There's also the question what to do with withenv. In theory, we could just have that be a task-local immutable overlay over the actual process environment. That would achieve the dynamically-scoped environment manipulation that I do think is probably useful for people using Julia for shell-like tasks. Of course at that point, we'd probably still need a way to modify the global environment (unsafe_setenv!?) for aforementioned miscellaneous C libraries, but perhaps that's a better situation in that it signals to users to avoid this interface if possible.

I'm feeling like this is the way to go. In addition, make ENV lookup also access that task-local immutable overlay of the process environment. At the moment, I feel like I see various julia libraries recommending ENV for configuration, and having syntax as simple and enticing as ENV["FOO"] = "BAR" as a surprising alias for please_create_a_data_race_with_some_random_c_library() doesn't seem great :-)

@c42f
Copy link
Member Author

c42f commented Feb 14, 2020

What do people think about compatibility here? I'm happy to do the work to fix this but I'm fairly sure we can't fix it completely without breaking compatibility. So we'll have to decide which trade off to take.

Here's several possible plans:

Plan 1

  • Make withenv add a Dict or possibly ImmutableDict{String,String} to task_local_storage (using :ENV_overlay as key?)
  • Make all ENV access first look that up before falling back to the global ccall to getenv. Including getindex and iteration.
  • Add locks around all the internal access to environ via getenv etc
  • Soft-deprecate setindex!(ENV) by adding a big warning discouraging it.
  • Add unsafe_setenv! and document that it's inherently unsafe, but is for configuring C libraries which insist on using environment variables for config.

This will avoid julia-only crashes but will do nothing for getenv data races in C libraries. It should keep most things working though there may be some edge cases with withenv and C libraries (which while they may work have no hope of ever working concurrently).

Plan 2

As in Plan 1, but essentially make ENV a normal Dict which is a copy of the environ at julia initialization. Add locks around it so that modifying it is threadsafe.

This will fix the data races and preserve normal within-julia use of ENV as it's done currently. However, it will break any code which is relying on C libraries seeing the updated ENV and those would have to transition to unsafe_setenv!. However doing that via Compat would be simple.

Plan 3

Make ENV an immutable copy of the environment. Encourage use of withenv (implemented as above) for convenience in launching processes. Provide unsafe_setenv! for desperate situations and discourage its use.

This may be conceptually clean but it will break so many scripts and startup files that I think it's Julia 2.0 material.

It has the benefit of preventing confusion for those who expect that ENV can be "truly" modified. On the other hand it would be less natural for simple imperative-style scripting which wants to do a lot of process spawning.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Feb 14, 2020

musl is pretty strict about not introducing environment access into functions not specified to depend on the environment, but glibc is not, so it's hard to even make a viable approach of adding locks all the functions that might use the environment.

Perhaps I'm being naive, but this doesn't seem so bad to me. We don't use a whole lot of libc functions and we can use even fewer pretty easily. Auditing which glibc calls we use that depend on the environment doesn't seem too bad. I guess the bad/intractable thing is that others may well write Julia code that uses libc directly via ccall or indirectly via calls to C that use libc, none of which would do the appropriate locking.

@richfelker
Copy link

I guess the bad/intractable thing is that others may well write Julia code that uses libc directly via ccall or indirectly via calls to C that use libc, none of which would do the appropriate locking.

Exactly. If it's intended that programs be able to call C library functions, then the problem is not tractable.

@richfelker
Copy link

Regarding #34726 (comment), I'm not a julia user, but I would think plan 2 is the best solution.

@StefanKarpinski
Copy link
Member

Here's an alternative. Since calling setenv is rare and slow anyway, halt all threads but the currently executing one while doing it. I.e. not just holding a lock which other threads have to know about: shut them down entirely and only continue them after the environment is modified.

@vtjnash
Copy link
Member

vtjnash commented Feb 14, 2020

Yeah, we can't entirely avoid them: we ship with e.g. BLAS which uses them, so working around this is problematic. It seems like it'd have to be mitigated at the libc level, but only Win32 exposes a thread-safe API. As c42f said above, it seems that libc maintainers could have chosen to mitigate this, though not required by posix to do so, but have apparently generally opted not to.

I'd suggest a plan 4:
Eliminate withenv, rename setenv! to unsafe_setenv!, use (and improve) setenv for launching processes.

@yuyichao
Copy link
Contributor

Stopping the thread won’t work. You either cannot guarantee that the code will ever finish or you can’t know you didn’t interrupt in the middle of a critical region.

@richfelker
Copy link

@StefanKarpinski: Exactly what @yuyichao just said - that can't work unless you have a way to wait for each thread to reach a "quiescent" waypoint where you know it can't be in a critical section.

@tkf
Copy link
Member

tkf commented Feb 15, 2020

  • Make withenv add a Dict or possibly ImmutableDict{String,String} to task_local_storage (using :ENV_overlay as key?)

I suppose this breaks processes launched in sub-tasks?

withenv("FOO" => "BAR") do
    @sync @async run(`sh -c 'echo $FOO'`)
end

For this to work, I suppose we need something like Context Variables in Python (ref PEP 567)?

For running commands, why not just encourage the pure API setenv(cmd, pairs...)? Maybe it's worth adding a convenience API setenv(cmd, ::AbstractDict, ::Pair...) as it's rather inconvenient to modify existing environment variables using setenv.

@c42f
Copy link
Member Author

c42f commented Feb 17, 2020

It seems like it'd have to be mitigated at the libc level, but only Win32 exposes a thread-safe API.

I was thinking more about this, especially about what happens when a library calls the C standard getenv on windows instead of directly calling the Win32 API. Well of course libraries may be likely to do this, and of course (in hindsight) it doesn't work. So even on windows the average C library will have exactly the same bug. For example I found this old issue from 2014 OpenMathLib/OpenBLAS#394 ... and to my surprise it's @vtjnash reporting this exact issue to OpenBLAS.

As c42f said above, it seems that libc maintainers could have chosen to mitigate this, though not required by posix to do so, but have apparently generally opted not to.

It's interesting that FreeBSD seems to have chosen to leak memory in this case - see BUGS section in https://www.freebsd.org/cgi/man.cgi?query=setenv&sektion=3&manpath=freebsd-release-ports. Though the man pages don't say what happens when using direct access to environ.

I'd suggest a plan 4:
Eliminate withenv, rename setenv! to unsafe_setenv!, use (and improve) setenv for launching processes.

Right, withenv seems pretty redundant with setenv. But what should we do about our current attractive-but-dangerous support for C setenv via the syntax

ENV["A"] = "B"

I feel like this is widely used and will break a fair few things if it's removed during Julia 1.x.

Plan 2 gives us a way of not breaking most uses of this syntax (at the cost of a confusing decoupling between the Julia ENV and underlying C env).

@c42f
Copy link
Member Author

c42f commented Feb 17, 2020

I suppose this breaks processes launched in sub-tasks?
For this to work, I suppose we need something like Context Variables in Python (ref PEP 567)?

Yes, task_local_storage doesn't play nicely with @async at all, but this is "just another" case of it. So I'm kind of presuming any solution to that general problem will also fix withenv as a special case. (I definitely think we need to sort this out and I think we need "some kind" of context inheritance from the parent task. I've thought a bit about it but haven't written anything yet. I'm hoping we may have some type-based equivalent of that PEP ContextVar which is arranged so that users can access task local storage in a more type stable way. But that's a whole separate discussion we can have elsewhere.)

@tkf
Copy link
Member

tkf commented Feb 17, 2020

Yes, I agree that context variables support should be in a separate discussion. I rather wanted to mention that, because it would take some time to get context variables machinery, plan 1 is probably not the best solution.

Plan 2 gives us a way of not breaking most uses of this syntax (at the cost of a confusing decoupling between the Julia ENV and underlying C env).

How about having a deprecation period (a few minor releases) where ENV[k] = v still invokes unsafe_setenv! with a deprecation warning? During this period, users would do using Future: ENV to use plan 2 ENV.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Feb 17, 2020

I don't believe that we can just deprecate ENV[k] = v without causing massive havoc in the ecosystem, so I'm trying hard to think of alternatives. There are basically three use cases for ENV[k] = v:

  1. communicate to Julia code in the same process that looks at ENV;
  2. spawn a subprocess with this changed environment;
  3. affect libraries via the libc environment.

We can make (1) threadsafe by maintaining our own shadow ENV dict that is separate from the libc environment. This shadow ENV can also be passed to subprocesses, which addresses (2) as well. It seems that (3) causes all the problems. Am I missing any cases?

@richfelker
Copy link

I think that pretty well covers it. I'm confused how (3) is actually a major issue though. Generally people don't do this (setenv) in C when calling libraries that use the environment to get defaults/settings, because it's not safe to do so there either. Instead, most libraries just use the environment for defaults and let you set the same parameters globally or in a context via some sort of runtime mechanism too. Using the environment for runtime configuration after process lifetime has started is anti-idiomatic in C to begin with.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Feb 17, 2020

Right, so in that case, we could just break (3) without breaking every single other use of ENV in the ecosystem. When people complain about their C libraries no longer picking up changes to ENV, we can say "don't do that, use the library's configuration API"; and we can also have an explicitly thread-unsafe unsafe_flush_env() function that copies the shadow ENV back to libc's environment if someone really just wants to make this work and doesn't care about thread safety.

@richfelker
Copy link

Yes, I think that sounds very reasonable.

@tgross35
Copy link

Rust wound up making env::set_var unsafe starting with the 2024 edition, the discussion is mostly at rust-lang/rust#27970 (that is probably the most extensive discussion on this topic). The gist is that there is no way to lock environment in a way that is effective across ecosystems, and that libcs probably won't be fixed because POSIX doesn't specify this (and shouldn't need to). Instead, setenv should be discouraged in almost all circumstances, which is about the same conclusion in this thread.

My extrapolation is that if any libraries read environment for configuration that is expected to be set at runtime, it should be considered a bug in that library. Instead the library needs to manage and lock its own state, read the environment once for defaults, and then provide API to update the state. It sounds like OpenBLAS may fall into this category?

TZ seems to be the common exception here. Rich mentioned that there has been discussion about eventually replacing this but it sounds like this is unlikely to be usable anytime soon.

glibc also recently merged bminor/glibc@7a61e7f, which improves the situation for existing code but doesn't change that the "correct" thing is to just never write the environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants