-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GNU jobserver posix client support #2474
base: master
Are you sure you want to change the base?
Conversation
3dc8b5e
to
6280cdd
Compare
I forgot I have to adapt to a windows build even if I'm not going to support it on windows... |
c1c6829
to
8530799
Compare
The CI for Windows is happy now, but it would be nice to have a tester for Windows... |
src/jobserver.h
Outdated
/// The number of currently acquired tokens, or the jobserver status if negative. | ||
/// Used to verify that all acquired tokens have been released before exiting, | ||
/// and when the implicit (first) token has been acquired (initialization). | ||
/// -1: initialized without a token | ||
/// 0: uninitialized or disabled | ||
/// +n: number of tokens in use | ||
int token_count_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to point out this concept and ask whether or not this usage of an int is too unusual or non-standard.
This coincides with the last line of the constructor Jobserver::Jobserver() and the value of capacity in CanRunMore() using the absolute value function.
It's pretty easy to rework this. I just happened to have this idea first, to use the token number in place of Enabled() when there actually are no tokens yet.
8530799
to
da3903d
Compare
found and fixed some minor mistakes, added some commit tags |
da3903d
to
dd128f2
Compare
sorry I didn't realize that there were "builder" constructors for the test suite, I built and ran the test suite this time. |
oops... I went too fast and everything is still "green" on segfault... |
dd128f2
to
7afdfb7
Compare
|
The |
ah, I see what you mean, it errored on readability... but the other thing I have doubts about
|
7afdfb7
to
ea8efb0
Compare
the bool now defaults to false, and set true with a simple if instead of a ternary, like in the rest of the project 👍🏼 |
I believe that all the minor issues caught by the CI are handled now... |
ea8efb0
to
db72b31
Compare
some simplification: I was reading the Google Style Guide and saw this
so changes in the last push are:
Then I realized that I no longer need to create a Jobserver object for
then finally, another opportunity to save lines:
|
db72b31
to
b820f3c
Compare
updated commit message |
@jhasse can we run the CI workflow again? |
b820f3c
to
363f209
Compare
small organization update:
|
363f209
to
947e29f
Compare
src/jobserver-posix.cc
Outdated
Jobserver::Jobserver() { | ||
assert(!Enabled()); | ||
|
||
// Return early if no makeflags are passed in the environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move all parsing logic to a separate static method that can be unit-tested properly. Also be aware that the GNU Make documentation states explicitly: Be aware that the MAKEFLAGS variable may contain multiple instances of the --jobserver-auth= option. Only the last instance is relevant. Hence this should be implemented properly (and tested).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, see also to implement the following requirement:
Your tool may also examine the first word of the MAKEFLAGS variable and look for the character n. If this character is present then make was invoked with the ‘-n’ option and your tool may want to stop without performing any operations
From https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, see also to implement the following requirement:
Your tool may also examine the first word of the MAKEFLAGS variable and look for the character n. If this character is present then make was invoked with the ‘-n’ option and your tool may want to stop without performing any operations
From https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html
I wish they didn't mention this in that part of the manual. It's completely out of scope for the jobserver. The jobserver client should not be responsible for determining whether ninja does a "dry run" even though it happens to be parsing MAKEFLAGS which can have many different flags for many different reasons...
This should be implemented in a separate commit (and separate PR) and directly with the BuildConfig object instead of involving the Jobserver objects. Traditionally, ninja does not rely on the environment for any of it's configuration flags, so that's another conversation as well.
src/jobserver-posix.cc
Outdated
// Tokenize string to characters in flag_, then words in flags_. | ||
while (flag_char_ < strlen(makeflags)) { | ||
while (flag_char_ < strlen(makeflags) && | ||
!isblank(static_cast<unsigned char>(makeflags[flag_char_]))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isblank() is locale-dependent, which leads to surprises and is generally slow. It is easier to just compare with ' '
and ' t'
in this case.
src/jobserver-posix.cc
Outdated
|
||
#include "util.h" | ||
|
||
Jobserver::Jobserver() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend splitting this PR into multiple commits, i.e.:
-
One that adds the Jobserver class, and its Posix implementation + appropriate unit-tests for it. Also try to make the class as independent from the rest of Ninja as possible (e.g. to not call Warning() or Info() in the parser function, leave that to clients).
-
One that adds usage of the class to build.cc / ninja.cc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend splitting this PR into multiple commits, i.e.:
In my opinion, the first commit in should be functional so that other projects can easily pull the patch while waiting for a release, and also for the "bisect rule", so each individual checkout is functional on it's own. I'm planning on Windows implementation and tests to be separate commits while trying to keep this one small (except for the new files)...
if (!jobserver_fifo_) | ||
Warning("pipe closed: %d (mark the command as recursive)", rfd_); | ||
else | ||
Fatal("failed to read from jobserver: %d: %s", rfd_, strerror(errno)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to fallback gracefully to the usual mode when this happens instead?
In general, it is better to avoid calling Fatal() in methods like these, because these conditions cannot be properly unit-tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the remaining calls to Fatal() are for the following cases:
- MAKEFLAGS from environment told ninja that a fifo object needs to have file descriptors open for it. The open() syscall is ran and errored.
- File descriptors have been opened to the fifo object, but reading from it has failed not due to blocking. This likely means that the fifo object has been deleted while we have file descriptors that point to nothing, or maybe a problem with the filesystem itself.
- File descriptors have been opened to the fifo object, but writing to it has failed, again suggesting that the fifo object has been deleted while we have file descriptors that point to nothing or a problem with the filesystem.
They can become warnings, but I think they protect execution from continuing when something is extremely wrong. It should be exceedingly rare for the program to step into these Fatal() calls and they are probably out of scope for unit tests anyway, but I'll let you decide.
7b35dc7
to
4e19445
Compare
big update... summary:
|
4e19445
to
0da771e
Compare
smaller update, this might be considered mergeable now. summary:
|
/// It must be called for each successful call to Acquire() after the command | ||
/// even if subprocesses fail or in the case of errors causing Ninja to exit. | ||
/// Ninja is aborted on write errors, and otherwise calls always succeed. | ||
virtual void Release(unsigned char*) {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain what the pointer being passed as argument here should point to? E.g. is it ok to call this with a pointer that points to a value that was never the result of a previous Acquire() code (as suggested by your Clear() function implementation). In which case, what is the default value to be used (also apparently \0
from the code, but you should make that clear in the documentation). Also will the function modify the pointed value or not?
Ideally, users of the API should not have to guess these details by looking at the source code.
I suggest writing a dedicated move-only Token class, even if trivial, to better encapsulate these semantics.
/// A wrapper for token values acquired or released to the pool.
/// A default instance has no value, and is used to indicate that
/// no token is available.
struct Token {
// Default constructor builds a value-less token.
Token() = default;
// Explicit constructor for a Token with a value from the pipe.
explicit Token(uint8_t value) : value_(static_cast<int>(value)) {}
// Move operations are allowed.
Token(Token&& other) noexcept : value_(other.value_) { other.value_ = -1; }
Token& operator=(Token&&) noexecpt = default;
// Copy operations are forbidden.
Token(const Token&) = delete;
Token& operator=(const Token&) noexcept = default;
/// Returns true if this instance contains a value received from the pipe.
bool HasValue() const { return value_ != -1; }
/// Return underlying value. It is a runtime error to call this method
/// if HasValue() returns false.
uint8_t GetValue() const {
assert(HasValue());
return static_cast<uint8_t>(value_);
}
int value_ = -1;
};
Then you can have Acquire()
return a Token by value, and Release()
take a token by value, and forget about pointers entirely, e.g.:
/// Try to acquire a token from the pool. A value-less token instance is returned
/// if no token is available.
virtual Token Acquire() { return Token(); }
/// Release a previously acquire token to the pool. Does nothing if the
/// token argument has no value.
virtual void Release(Token token) {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote comments in the header to reflect the change in functionality and new returns and arguments.
Making a whole class just to manage the value of the tokens seems complicated and overkill to me. It makes the function declarations read better, but the Token class itself is not very readable and adds a lot of object constructs for not much benefit...
Is there any benefit to avoiding pointers here? or avoiding char
?
What if we just made a simple typedef for the tokens?
I'm very confident that a jobserver will never return a NUL char unless it's to indicate "no tokens available" just as I am using it to mean. I'm also pretty confident that Ninja will never receive anything other than a '+'
when being used with Make, probably even with forks of Make...
0da771e
to
0100109
Compare
applied some review comments summary:
|
@digit-google for these comments from you, would you consider any of them critical that I must do? Or perhaps keeping things simple may be better at this point? Presence of |
0100109
to
da179c9
Compare
|
Oh, I realize I was completely wrong in one of my earlier comments where I stated that Ninja would not pass file descriptor to command sub-processes. It actually totally does, so using (Turns out that posix_spawn() doesn't close heritable file descriptors as I mistakenly thought). |
No problem, I had to double-check because I wasn't sure myself, but at least documentation on file descriptors is pretty clear. There is a lot of confusion about how |
The core principle of a jobserver is simple: before starting a new job (edge in ninja-speak), a token must be acquired from an external entity as approval. Once a job is finished, the token is returned to represent a free job slot. In the case of GNU Make, this external entity is the parent process which has executed Ninja and is managing the load capacity for all subprocesses which it has spawned. Introducing client support for this model allows Ninja to give load capacity management to it's parent process, allowing it to control the number of subprocesses that Ninja spawns at any given time. This functionality is desirable when Ninja is part of a bigger build, such as Yocto/OpenEmbedded, Openwrt/Linux, Buildroot, and Android. Here, multiple compile jobs are executed in parallel in order to maximize cpu utilization, but if each compile job in Ninja uses all available cores, the system is overloaded. This implementation instantiates the client in real_main() and passes pointers to the Jobserver class into other classes. All tokens are returned whenever the CommandRunner aborts, and the current number of tokens compared to the current number of running subprocesses controls the available load capacity, used to determine how many new tokens to attempt to acquire in order to try to start another job for each loop to find work. Jobserver related functions are defined as no-op for Windows pending Windows-specific support for the jobserver. Co-authored-by: Martin Hundebøll <martin@geanix.com> Co-developed-by: Martin Hundebøll <martin@geanix.com> Signed-off-by: Martin Hundebøll <martin@geanix.com> Signed-off-by: Michael Pratt <mcpratt@pm.me>
da179c9
to
9d67962
Compare
|
} | ||
|
||
// --jobserver-auth=<val> | ||
for (size_t n = 0; n < flags.size(); n++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these loops look like they're missing break;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lack of break
is intentional, so only the last matching value is the one that takes effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like it would be better to reverse the loop then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure it can be reversed, but since this will only be run once per execution of ninja it's not performance critical, so I lean towards the simplest way to write it with less lines and no brackets, etc...
while (flag_char < strlen(makeflags)) { | ||
while (flag_char < strlen(makeflags) && | ||
!isblank(static_cast<unsigned char>(makeflags[flag_char]))) { | ||
flag.push_back(static_cast<unsigned char>(makeflags[flag_char])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if emplace_back and no cast is better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure exactly what the difference is, I would have to read about it...
the cast is not strictly necessary, it's just protection against invalid characters that I ran into while reading about tokenizing strings.
I saw something about using a std::vector<uint8_t>
instead of std::string
but that might be confusing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently after optimization, push_back
and emplace_back
are exactly the same, unless you are doing fancy constructors for the elements in the vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... no cast is better here.
@neheb I got it from the example here
@digit-google @neheb would this be better? to make the range explicitly equal to
|
a rework of #2450 supporting all versions of GNU Make, but without Windows support
(I'm not able to test for Windows, and I have doubts with proposed Windows support)
resolves #1139 for posix systems
thanks to @hundeboll for much of the work with this newer implementation
ping @jhasse @digit-google
significant differences:
referencespointers to the jobserver class into other classesSIZE_MAX