-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide resiliency against inputs changing during the build #1943
Provide resiliency against inputs changing during the build #1943
Conversation
54b3de6
to
8bee1e5
Compare
8bee1e5
to
0b16783
Compare
The CMake test suite fails several tests with this change:
@jdrouhard I'll contact you externally about reproducing the failures. |
0b16783
to
956d1ce
Compare
All CMake Tests pass with this branch, now. The failing tests were due to a slight hole in a regression test I added earlier in #1935. Generator rules should continue to record outputs' mtimes as they are on disk when a command finishes instead of using the time when the command is started. This branch now makes generator rules use the existing behavior instead of the new behavior. @bradking feel free to double check this. I ran the entire CMake test suite, including all Qt5 tests. |
@jdrouhard the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what the performance implications of touching a file are? I'm also scared about edge-cased like NFS mounted filesystems.
Shouldn't one file be enough? I know that this will result in unnecessary rebuilds if an input is changed while its edge hasn't started, but better safe than sorry (and it would bring clarity to the performance implication of this PR).
I think one file might be a good idea. First of all we can create it next to .ninja_deps and .ninja_log (a directory we definitely have write access to). Second we might re-use it as a locking mechanism: Currently it's possible to run ninja multiple times in a directory resulting in all sorts of bugs. What do you think?
Another thought: What about the hash approach Evan suggested? While this would be a HUGE change, it might be something we want to do in the future anyway, see #1459
src/build_test.cc
Outdated
ASSERT_EQ(1u, fs_.files_created_.count("out 2.rsp")); | ||
ASSERT_EQ(1u, fs_.files_created_.count("out 3.rsp")); | ||
ASSERT_EQ(1u, fs_.files_created_.count(".ninja_tmp")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this is an implementation detail and shouldn't be part of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the file is created by the virtual file system during the test; putting the assert here makes it clear why there were 3 files created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, I would rather remove the explicit check for ASSERT_EQ(files_created + 2, fs_.files_created_.size());
. This unit test shouldn't care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll remove it.
src/build_test.cc
Outdated
ASSERT_EQ(1u, fs_.files_removed_.count("out 2.rsp")); | ||
ASSERT_EQ(1u, fs_.files_removed_.count("out 3.rsp")); | ||
ASSERT_EQ(1u, fs_.files_removed_.count(".ninja_tmp")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely be part of the test though. Maybe fs_ should have a method to count files and we should check that everything was cleaned up.
@@ -198,7 +198,7 @@ TEST_F(DiskInterfaceTest, MakeDirs) { | |||
EXPECT_EQ(0, fclose(f)); | |||
#ifdef _WIN32 | |||
string path2 = "another\\with\\back\\\\slashes\\"; | |||
EXPECT_TRUE(disk_.MakeDirs(path2.c_str())); | |||
EXPECT_TRUE(disk_.MakeDirs(path2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Too bad our CI doesn't run clang-tidy on Windows (yet).
For any edge, it only touches a single file. I don't believe there will be much performance impact here since the edge command will be writing its outputs to the disk when it runs anyway.
As we've discovered in the previous PRs #1753 and #1932 we really do need to track the beginning of each edge separately. However, I can change it so the temp file we use lives next to Using it as a locking mechanism could also come in handy if we decide to clean the file up at the end of the run. I'm not going to tackle that in this PR, though.
This would be a massive change that would likely break many many things. There are many things to consider with a hash approach, and in general, using a hash instead of mtime has its own set of problems that are just different than mtimes. At least the approach in this PR makes it pretty resilient and is a definite improvement over current functionality. |
@jdrouhard there is a section of the |
What if the edge doesn't have any outputs? Also the edge writes to its outputs, you create, stat and delete a file - depending on the filesystem this might actually take longer. Just speculation on my part, too, we should really have a good way to back this up with a benchmark, see #1733.
If I understood it correctly it definitely makes things tricky, but not impossible. Can you elaborate? |
If the edge doesn't have any outputs then it doesn't create or stat the temp file. For the edges that do produce outputs, I am updating the PR to simply use a sentinel
I'm not sure what the alternative is, record the same mtime for every single output whose command was run during the course of the build? |
956d1ce
to
8d191a9
Compare
@bradking tested with Qt 5.15. restat rules must also restat the outputs when the command finishes (along with generator rules). Let me know what else you'd like to see. |
@jdrouhard could you try out my unit test from #1941, wether this goes in or that PR, I would like the unit test to be added to protect the hardlinking scenario that is causing me trouble currently in my builds. |
Thanks. With that fix, this PR as of commit |
8d191a9
to
3104540
Compare
@dmoody256 I can try your test on my branch. I'd like to see this get merged as is, and then we can change your PR to just be a new regression test, if that works for you? |
@jdrouhard sure thats fine, i'll try to follow up in my PR |
@jhasse this should be good to go. If you need more time to look it over, maybe you could at least add a label for the next ninja release? |
@jhasse ping |
This has been discussed at length in those other PRs I mentioned. I'm pretty sure you were part of those discussions. In any case, the build log needs to remember an mtime it can use to compare against on subsequent runs. Where do we source that mtime? It has to come from something at the time the build command begins or we lose the "anchor" time for knowing whether something was changed since the inputs were read by whatever command produced the output. If we get it from the most recent mtime, then we have to know the most recent input's mtime when the build command starts. That's #1753 and #1936. To know the most recent input's mtime is a performance bottleneck since we have to stat every known input in order to find that mtime. And if we don't know all the inputs, it won't be an accurate time anyway. Currently ninja doesn't track anything like that at all, it simply records the output's mtime after the command finishes. Like I said, this attempt (most recent input mtime) was tried in those PRs. You were willing to give them a shot, I'm asking you give this PR the same as it solves the problem better. Let's just let it hit master do it gets more exposure for awhile. |
Yes, but only because we always compare against that mtime. Couldn't ninja use the output mtime instead if it was comparing against an (at the time of the previous run) unknown input? |
You are now suggesting a build log version bump where we record both the output's actual mtime and a maybe-incorrect most recent input mtime in the build log? We would also have to be able to identify which of those two we should use based on context we don't have. Somehow we'd have to track whether the known input set changed for a particular output since the last run, and that gets ambiguous (not possible) with restat and generator rules which create inputs for other outputs, including order only dependencies that create inputs for other edges when they run. Doing it as you suggest is actually a more intrusive and scary change for a solution that wouldn't even provide the same safety as this PR. |
src/build.cc
Outdated
// XXX: this will block; do we care? | ||
for (vector<Node*>::iterator o = edge->outputs_.begin(); | ||
o != edge->outputs_.end(); ++o) { | ||
if (!disk_interface_->MakeDirs((*o)->path())) | ||
return false; | ||
if (build_start == -1) { | ||
disk_interface_->WriteFile(tmp_mtime_file_path_, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which command updates the mtime in RealDiskInterface::WriteFile
? Couldn't fwrite
just do nothing if it was passed an empty buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwrite will update the file to be empty if it is passed an empty buffer, it can't simply decide to do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if it is already empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will still write the empty string to the file and update the mtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this documented / specified somewhere?
Wouldn't matter if we had real-world tests we could run on all systems :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.cppreference.com/w/c/io/fwrite
While it's ultimately up to the filesystem implementation to update the mtime if a file is written to, fwrite() does guarantee that the file is written to. We specifically take advantage of using the filesystem for acquiring an mtime because we compare this acquired time to the input mtimes on subsequent runs, which are presumably residing on the same filesystem (or one with times that are closely in sync).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If size or count is zero, fwrite returns zero and performs no other action.
Sounds like it doesn't write to the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, it's not fwrite()
that causes the mtime to be updated, it's the fopen(path, "w")
call that does so. The fact that we're writing 0 bytes is actually an optimization that means the fwrite()
call itself is a no-op (like you say).
fopen()
creates the file if it does not exist, or deletes (overwrites) the file if it already exists. Either way, mtime is updated.
} | ||
|
||
// Dirty if the output is older than the input. | ||
if (!used_restat && most_recent_input && output->mtime() < most_recent_input->mtime()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output->mtime()
is the command_start_time_
of the previous run, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, output->mtime()
is the actual mtime of the file on disk when this build was started (it's calculated the first time the output is considered when generating the build graph).
src/build_test.cc
Outdated
ASSERT_EQ(1u, fs_.files_created_.count("out 2.rsp")); | ||
ASSERT_EQ(1u, fs_.files_created_.count("out 3.rsp")); | ||
ASSERT_EQ(1u, fs_.files_created_.count(".ninja_tmp")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, I would rather remove the explicit check for ASSERT_EQ(files_created + 2, fs_.files_created_.size());
. This unit test shouldn't care.
I'm reluctant to merge this PR because I think it's an invasive change to how ninja works - after this we won't be comparing output mtimes anymore. As you notice I haven't fully understood all the edge cases, too, and don't want to do a revert again. There's also the problem that we don't have any good performance testing in place so the implications are probably small. Hopefully. For every use-case? Another thing that would need to change, but it's a bigger issue of ninja: The unit tests only use the Last but not least I always thought about adding a locking mechanism to ninja, so that two runs don't interfere with each other. Maybe we could re-use Another idea to think about would be to re-use
Why wouldn't it be possible? |
I understand. I know it's a big change, but with any good improvements like this, there's a bit of risk involved. It solves a number of issues that have been opened related to this problem and definitely makes ninja more resilient to real word usages. In this case, it will be still be comparing output mtimes, but what it compares against is sometimes the time the command that produced it began, and other times is exactly what it is now (mtime of the output when it finished which is the case for restat and generator rules).
This is true. The only way to gauge the true performance impact is to let it be used in as many real world situations as it can before a release is made with the change in it. If you have suggestions for how we can increase the confidence level in the performance impact, I'm all ears. My anecdotal evidence is from a Linux perspective where I daily run builds with 4000 edges that each have hundreds of inputs. There's no discernible performance impact vs current master.
Agreed. We can unit test all day long but if a real file system doesn't behave exactly the way a test author assumes, the unit tests won't capture real world behavior. System tests like these are typically more brittle as they will rely on factors outside the control of the test, but having some would be a good step forward in any case.
That seems like a good change. Maybe a follow up to this PR after it's merged?
Without actually trying this change, I can't say for certain, but after my experience with the various attempts to solve this problem, I know that it's not quite so simple as tracking the inputs for an edge. There are some unique rules that can exist that don't really mesh with simply checking whether inputs have changed. It might be possible. I do think l, however, that this change would be even more invasive than what this PR does, and would only sometimes help the resiliency. |
Thank you for being so patient with me :) Sorry it all takes so long!
Fix #1733 and let a script run on CI (although I'm not how sure how consistent performance is during a run on GitHub Actions - between runs it probably is comparable). |
3104540
to
a746bb1
Compare
No worries, thanks for seeing this through. I've been trying to get this problem fixed for about 5 years so I'm glad there's finally traction for it.
Are you making #1733 a prerequisite for getting this PR merged, or can we merge this, assuming we can get to #1733 before the next release? |
ping @jhasse -- any other comments before you get this merged in? |
a746bb1
to
1bb760f
Compare
1bb760f
to
02c0064
Compare
This has been thoroughly tested by the full cmake test suite and we've been using this patch at my place of work for nearly a year with zero issues. I know you wanted to release a new version before merging this so it could have maximal time in HEAD for testing, but it doesn't seem like a release is coming soon. Can we get this merged so we can get more exposure? I don't think this PR is going to have any discernible performance impact. |
02c0064
to
ba7a4f8
Compare
ba7a4f8
to
6a9edea
Compare
6a9edea
to
40d5f17
Compare
Looks like 1.11.0 was released recently. Congrats on the release! Looks like a great one. I rebased this PR on current master. Could we get it merged soon to let this bake as long as possible before the next release? Thanks @jhasse! |
Yes :) Lets get the filename right first though. |
I don’t have a preference whatsoever on the lock file name. Which do you prefer? |
Let's use In a future PR we would then write the PID to this file to check if it's a zombie file, right? I don't have any experience in how to do directory locking :D |
…he build runs When an edge starts to run, create a temporary lock file in the build directory, stat it, and cache its mtime. When the command finishes, use the temporary lock file's mtime from when the edge started running as the mtime that is recorded in the build log for each of the edge's output(s). Subsequent runs will use that as the mtime for the output(s). This provides robustness against inputs changing while the command itself is running. If an input is changed, the subsequent run will detect the output as dirty since its recorded mtime reflects when the build command began, not when the output was actually written to disk. Generator and restat rules are exempt from this and will continue to record their actual mtime on disk at the time the command finished in the build log (unless the restat rule cleans the output). This avoids potential infinite loops when the generator rule touches input dependencies of the output(s) or a restat rule intentionally changes implicit dependencies of its output.
40d5f17
to
a2b5e6d
Compare
Done.
On Linux at least, this is pretty easy. We can make ninja acquire a lock on this file before running, which would block other ninja processes from running concurrently. I'm not as familiar with the windows side of things but I'm sure something similar can be achieved. |
Some projects implement logic that invokes the build within a build (nested invocation). I'm not saying that's a good pattern, but I've seen that in real world projects more than once. It's often used to run some kind of test or utility task during the build. For such logic, if ninja blocks another ninja process from running in a build directory, then such child ninja invocations on the same build will never get a chance to run and will appear to hang. |
Good point. We could provide a flag to disable the check for such setups (and for speed). |
This is a better, faster approach than #1753 for providing robustness against inputs changing during the build. It fixes #1162 and #1940.
When recording the outputs' mtimes when the command finishes, use the
remembered temporary file's mtime from when the edge started for
recording in the build log. Subsequent runs will use that as the mtime
for the output.
This provides robustness against inputs changing while the command
itself is running. If an input is changed, the subsequent run will
detect the output as dirty since its recorded mtime reflects when the
build command began, not when the output was actually written to disk.