-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teach dune make's jobserver API #4331
Conversation
will be useful for the job server Signed-off-by: Rudi Grinberg <me@rgrinberg.com>
Signed-off-by: Rudi Grinberg <me@rgrinberg.com>
Great! Just to be clear it is adding dune as a user of the API (e.g. dune is a sub-process of make), not the other way. |
For the other way around, i.e. dune sharing parallelism with commands it starts, there is one pitfall: if the command crashes without releasing its token then the tokens are lost forever and parallelism is reduced for the rest of the build. That's something to be careful for this PR as well: does Dune make sure to always release its tokens when it is a client? |
Yeah, although it would be nice if dune was a server as well.
Not necessarily. When the process that dune launched terminates, we can always release any unreleased tokens.
Yeah, that's a good point. Let me double check. |
But how do you do that? |
FTR, we stumbled upon this while implementing the protocol in jenga, and AFAIK it is not possible to recover the lost token as the protocol doesn't tell you who is holding them. We looked this up and found out that they hit the same problem in rust, but didn't find a solution. It seems to be a problem with the protocol itself. So for this reason I'm a bit nervous to expose this to command run by dune directly. At the same time, it would be nice to share parallelism when you call |
Proposition |
But what if you have several such actions running in parallel? Ideally you want to give tokens to the action that wants more token, but there is no way of knowing that. What I had in mind was something more lightweight: we add an opt-in and document the fact that such actions have to be careful of always giving back their tokens. We can assume that make is safe, given that this protocol was designed for recursive invocations of make. |
I think opt-in is interesting and work well with what I propose. |
I see. So the issue is that whenever a token is read from the fifo, we don't know who the reader is. I wonder if we should just tweak the protocol to fix this issue, and use this new protocol to communicate with opam. Dune could still respect make's protocol, but would prefer its own when available. The main goal for me with this feature is to have something that works with |
let with_job { read; write; jobserver } ~f = | ||
let open Fiber.O in | ||
let* job = | ||
Task_queue.task_exn read ~f:(fun () -> Jobserver.wait_next jobserver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't take a token for the very first command. Indeed, when a client starts, it always has at least one token, otherwise it wouldn't be running. When the client wants to do two things in parallel, it should use its current token for one of the things.
Another way to see the problem: imagine the job server has a single token. It starts a dune process, which is holding this token. Now dune wants to spawn a process and tries to take a token: it will wait forever because there is no more token.
in | ||
Fiber.finalize f ~finally:(fun () -> | ||
let+ res = | ||
Task_queue.task write ~f:(fun () -> Jobserver.Job.release job) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to wait for the token to be written here. We should just queue it, signal the writer thread and move on.
@@ -0,0 +1,13 @@ | |||
open Stdune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a url pointing to the jobserver spec
@@ -0,0 +1,23 @@ | |||
open Stdune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some doc here as well please
@@ -172,6 +177,7 @@ end = struct | |||
| Job_completed of job * Unix.process_status | |||
| Signal of Signal.t | |||
| Rpc of Fiber.fill | |||
| Jobserver of Fiber.fill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rpc
and Jobserver
seems to be exactly the same. What about replacing them both by a single Thread_safe_ivar
event?
That seems overkill if you ask me. Like trying to somehow wrap an unsafe protocol into an safe one. I think we should just use the protocol the way it was intended to be used and accept that it relies on command doing the right thing. Now, there is another problem that requires some discussion: when a client starts, it always holds at least one token. Let's say this client wants to do X1 and X2 in parallel. For X1, it will simply reuse its current token. And before starting X1, it will spawn a separate thread that will wait for a token and do X2. Now let's say that X1 finishes before X2 has had a chance to get its token. What do you do?
Option 2 is what we did inside Jane Street. However, it requires being able to reliably cancel a system call. Which is pretty tedious. @aalekseyev wrote a small library for it, we could reproduce it in Dune. But I'm curious what other implementations are doing. From what I know, rust implemented the jobserver protocol, might be worth checking. If everyone is going for option 1, perhaps with some mitigation, I feel like we should do the same to make our lives easier. Plus if they do that, whether dune does it or not won't make a big difference in the end. |
Ah, well there is also the idea of having opam setup a single workspace and run dune only once, which would lead to even faster build times. |
If you use |
Using |
No command will be able to do the right thing against a |
My overall thoughts on the jobserver are:
|
Can you point me to the protocol definition? the code of gnu-make seems to set it non-blocking and use pselect. |
It's said there: https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html If you put it in non-blocking mode, it's going to be put in non-blocking mode for everyone and some clients might not expect that. |
I guess implementation trumps the docs. I have to admit I never looked at the implementation, actually. Indeed the cancellation is much simpler if we demand that the fd is nonblocking. |
Alright then, the jobserver comes from GNU Make, so let's treat GNU Make as the reference implementation. That means that we have the full story for this PR then, which is about making Dune a jobserver client. It'd be nice to send an email to the GNU Make devs though. Maybe they are not aware the doc is not in line with the code. @bobot, do you want to send this email? Happy to do it otherwise. BTW, I was thinking again about the loss of tokens issue, and it occurred to me that we can always turn a loss of tokens into a temporary loss of tokens by re-creating the jobserver in between builds in polling mode. This means that the build will temporarily be slower in some cases, but on the other hand we keep a simple design. |
You could send the mail, thank you. |
I haven't got a response from the GNU make developers yet, but I skimmed through the mails related to the jobservers and TBH that doesn't make me very confident in using pipes for sharing parallelism:
My overall impression is that using pipes for this was a nice idea, but it has many issues in practice and might not be worth the troubles. @rgrinberg I know that you care about this for |
We just have the compilation of a clang plugin (part of Frama-Clang) that will be executed by dune. So indeed the direction Otherway to give more than one core for the subprocess: we could use a static |
That could work. Another idea would be to make Dune understand the Makefile language. A bit ambitious, but I'd be keen to try that during a Dune retreat :) |
This passed me by (I blame UK lockdown 3, though I blame it for most things!). Is this still hoped for Dune 3.0? Two reasons for asking - the Windows implementation of jobserver is really easy (it's just a named semaphore - in fact, it's not immediately clear to me why they don't allow that for the POSIX jobserver as well, but perhaps GNU make still runs on a lot of very non-POSIX-y platforms). More importantly, I did a bit of hacking on the train this weekend with opam and prototyped jobserver support for opam (it's working correctly with two |
Opam creating the jobserver is a nice case (I believe it was discussed already during opam 1.0 time :D ) . Just for fun, an article from LWN The edge-triggered misunderstanding about a modification of the pipe handling in Linux which had bad consequences for |
Speaking of fun edge cases, we saw a case where Linux in some cases decides that [write] should block even though there's very little data in the internal buffer of the pipe (even 1 byte can be enough). This breaks the jobserver model (makes it deadlock). You can work around this by reading from the pipe periodically and writing again. I wonder how make deals with that. |
This PR fixes #2647 btw |
At this point I have seen so many problems with this protocol that we wouldn't be doing anyone a favour by implementing it. Dune or commands spawned by Dune would just start freezing in many situations and it would be a pain to debug. Dune would just become a less reliable software. Let's stop there. |
Does dune expose it's jobserver in some way, so that I could write an external shim that interfaces between dune's job server and make's job server? |
@JasonGross how do you plan to workaround the issues pointed by Jeremie if you implement the protocol? |
@rgrinberg I'm interested in a best-effort approach that doesn't tie dune to the limitations of make's jobserver protocol, but instead has a separate process that interops between the jobservers of dune and make. My use-cases are:
My primary desideratum is that if the overall build would succeed (given enough cores and memory) when the parallelism modes uncoupled, then the shim should not cause the build to fail / hang / deadlock. The most important issue to me seems to be:
This is easy to fix: the shim interoping between make's jobserver and dune's should have a dedicated thread for writing to the pipe on the make-side, and interaction on the dune-side should not block on this thread. |
@JasonGross note that for both of your use cases, alternative solutions are planned. In the first case, we will eliminate make completely, in the second, documents managers should take care of that, and it is beyond what dune can do. |
Since this is technically a breaking change, I thought we might as well do it in 3.0
I went for a very simple implementation that reuses the background thread task queue from csexp_rpc. There's probably a better way to do this however.