Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid rebuilding a project when cwd changes #4788

Merged
merged 2 commits into from
Dec 12, 2017

Conversation

alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Dec 7, 2017

This commit is targeted at solving a use case which typically comes up during CI
builds -- the target directory is cached between builds but the cwd of the
build changes over time. For example the following scenario can happen:

  1. A project is compiled at /projects/a.
  2. The target directory is cached.
  3. A new build is started in /projects/b.
  4. The previous target directory is restored to /projects/b.
  5. The build start, and Cargo rebuilds everything.

The last piece of behavior is indeed unfortunate! Cargo's internal hashing
currently isn't that resilient to changing cwd and this PR aims to help improve
the situation!

The first point of too-much-hashing came up with Target::src_path. Each
Target was hashed and stored for all compilations, and the src_path field
was an absolute path on the filesystem to the file that needed to be compiled.
This path then changed over time when cwd changed, but otherwise everything else
remained the same!

This commit updates the handling of the src_path field to simply ignore it
when hashing. Instead the path we actually pass to rustc is later calculated and
then passed to the fingerprint calculation.

The next problem this fixes is that the dep info files were augmented after
creation to have the cwd of the compiler at the time to find the files at a
later date. This, unfortunately, would cause issues if the cwd itself changed.
Instead the cwd is now left out of dep-info files (they're no longer augmented)
and instead the cwd is recalculated when parsing the dep info later.

The final problem that this commit fixes is actually an existing issue in Cargo
today. Right now you can actually execute cargo build from anywhere in a
project and Cargo will execute the build. Unfortunately though the argument to
rustc was actually different depending on what directory you were in (the
compiler was invoked with a path relative to cwd). This path ends up being used
for metadata like debuginfo which means that different directories would cause
different artifacts to be created, but Cargo wouldn't rerun the compiler!

To fix this issue the matter of cwd is now entirely excluded from compilation
command lines. Instead rustc is unconditionally invoked with a relative path
if the path is underneath the workspace root, and otherwise it's invoked as an
absolute path (in which case the cwd doesn't matter).

Once all these fixes were added up it means that now we can have projects where
if you move the entire directory Cargo won't rebuild the original source!

Note that this may be a bit of a breaking change, however. This means that the
paths in error messages for cargo will no longer be unconditionally relative to
the current working directory, but rather relative to the root of the workspace
itself. Unfortunately this is moreso of a feature right now rather than a bug,
so it may be one that we just have to stomach.

Closes #3273

@rust-highfive
Copy link

r? @matklad

(rust_highfive has picked a reviewer for you, use r? to override)

Copy link
Member

@matklad matklad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me, and I think "purge CWD, use paths relative to a fixed position" is the right approach, but I am not sure that "workspace root heuristic" is a good enough implementation. It's certainly a heuristic because some workspace members may be outside of the workspace, and some non-members may be inside workspace. Not sure if we need 100% correct implementation here though.

if fs_try!(f.read_until(0, &mut cwd)) == 0 {
return Ok(None)
}
let cwd = util::bytes2path(&cwd[..cwd.len()-1])?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that new versions of Cargo will fail to parse old depinfo, which is actually OK, because it'll cause only a rebuild, and not a build failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I'm not actually sure what will happen but the hashes are all changing so I don't think newer cargo will use the same files from older cargo.

let ws_root = cx.ws.root();
let src = unit.target.src_path();
assert!(src.is_absolute());
match src.strip_prefix(ws_root) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too happy about this solution, because there's no guarantee that workspace members reside below workspace root (see also #4787 (comment) :) ).

I think the ideal behavior here is to use all explicit worksapce members which lay outside of the root package as anchors for paths as well. Not sure if we need to support this right from the start though.

fn hash<H: Hasher>(&self, _: &mut H) {
// ...
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slightly more elaborte design here would be to introduce a notion of explicitly relative path here and track path relative to workspace root and to other workspace members outside the root. Sort of like

struct PathRoots {
  // a number of anchors, such as current workspace dir, a particular workspace member, users's home directory, 
  // CARGO_HOME, etc. This structure is "global" for cargo and is stored in Workspace/config.
  roots: Vec::<PathBuf>, 
}

struct RelativePath {
  root_idx: usize // index in the global PathRoots
  rel_path: PathBuf
}

impl RelativePath {
   fn to_path(&self, roots: &PathRoots) -> PathBuf {
    roots[self.root_idx].join(&self.rel_path)
  }
}

Not sure that we need this, but this can cope with workspaces which are not entirely under a workspace root. (Or perhaps it was a mistake to allow non-subdirectory members?).

@alexcrichton
Copy link
Member Author

@matklad thanks for taking a look! It's true yeah that workspace members outside the workspace root do pose a problem. That being said I wasn't really sure how we'd handle them at all :(

I originally had a much more complicated patch which entirely refactored the src_path field into a "root" and a "extra" like you mentioned (only hashing the extra, not the root) but that ended up not working for a whole number of reasons.

So thinking about it... Right now the workspace root comes up in a few locations:

  • When we invoke rustc we give it a relative or non-relative path. This is actually pretty important for debuginfo and backtraces and such (so everything doesn't show up as src/lib.rs).
  • When we parse and manage the dep-info files

I think it's a pretty reasonable assumption to assume all rust files for a crate are within the same folder, so I think we could pretty easily adapt the dep-info files to only list files relative to the package root rather than the workspace root (basically do some serious postprocessing).

For invocations of rustc though I'm not sure what we could do. I think (especially due to the existing bug in Cargo) we need to pick a root path to have and to me the workspace root seems the most obvious, but it's true that it then means you have to rename the entire workspace at once and if you have a crate outside of a workspace you basically just can't rename it. (or we could figure out a scheme where we pass things like ../foo/src/lib.rs to rustc, aka figuring out relative paths with ..)

What do you think? Maybe we should fix up dep-info parsing to work without using ws.root() and leave the "what path do we provide to rustc" for later?

@matklad
Copy link
Member

matklad commented Dec 7, 2017

Hm, the part about rustc is interesting from the IDE point of view: we need to linkify stack traces, so it's important that stacktraces use absolute paths or relative paths with a known root. Semantically, the proper solution seems to root everything at workspace, and use paths exactly as the user specified them.

So, if you write members = ["../foo", "/home/workspace/bar"], ../foo and /home/workspace/bar end up in the compiled binary.

I think it's a pretty reasonable assumption to assume all rust files for a crate are within the same folder, so I think we could pretty easily adapt the dep-info files to only list files relative to the package root rather than the workspace root (basically do some serious postprocessing).

Hm, depinfo is produced by the compiler, so in theory solving the problem with rustc should automatically solve the depinfo problem?

How hard is it to implement the .. approach?

@matklad
Copy link
Member

matklad commented Dec 7, 2017

I think it's a pretty reasonable assumption to assume all rust files for a crate are within the same folder, so I think we could pretty easily adapt the dep-info files to only list files relative to the package root rather than the workspace root (basically do some serious postprocessing).

This assumption is also pretty important for IDEs, because we need to know which stuff to index, so +1 to making it an official and documented requirement :)

Though, I can imagine someone doing weird stuff like include_str("/etc/passwd"). We should probably not flat out break for such cases, but weird stuff like not doing a rebuild looks fine to me

This commit is targeted at solving a use case which typically comes up during CI
builds -- the `target` directory is cached between builds but the cwd of the
build changes over time. For example the following scenario can happen:

1. A project is compiled at `/projects/a`.
2. The `target` directory is cached.
3. A new build is started in `/projects/b`.
4. The previous `target` directory is restored to `/projects/b`.
5. The build start, and Cargo rebuilds everything.

The last piece of behavior is indeed unfortunate! Cargo's internal hashing
currently isn't that resilient to changing cwd and this PR aims to help improve
the situation!

The first point of too-much-hashing came up with `Target::src_path`. Each
`Target` was hashed and stored for all compilations, and the `src_path` field
was an absolute path on the filesystem to the file that needed to be compiled.
This path then changed over time when cwd changed, but otherwise everything else
remained the same!

This commit updates the handling of the `src_path` field to simply ignore it
when hashing. Instead the path we actually pass to rustc is later calculated and
then passed to the fingerprint calculation.

The next problem this fixes is that the dep info files were augmented after
creation to have the cwd of the compiler at the time to find the files at a
later date. This, unfortunately, would cause issues if the cwd itself changed.
Instead the cwd is now left out of dep-info files (they're no longer augmented)
and instead the cwd is recalculated when parsing the dep info later.

The final problem that this commit fixes is actually an existing issue in Cargo
today. Right now you can actually execute `cargo build` from anywhere in a
project and Cargo will execute the build. Unfortunately though the argument to
rustc was actually different depending on what directory you were in (the
compiler was invoked with a path relative to cwd). This path ends up being used
for metadata like debuginfo which means that different directories would cause
different artifacts to be created, but Cargo wouldn't rerun the compiler!

To fix this issue the matter of cwd is now entirely excluded from compilation
command lines. Instead rustc is unconditionally invoked with a relative path
*if* the path is underneath the workspace root, and otherwise it's invoked as an
absolute path (in which case the cwd doesn't matter).

Once all these fixes were added up it means that now we can have projects where
if you move the entire directory Cargo won't rebuild the original source!

Note that this may be a bit of a breaking change, however. This means that the
paths in error messages for cargo will no longer be unconditionally relative to
the current working directory, but rather relative to the root of the workspace
itself. Unfortunately this is moreso of a feature right now rather than a bug,
so it may be one that we just have to stomach.
@alexcrichton
Copy link
Member Author

@matklad ok I've amended the first commit and pushed up another one. This should reduce the reliance on ws.root() except in one location. Everything dealing with fingerprints/dep-info doesn't deal with ws.root() at all and should now be improved to be resilient to renames.

The one usage of ws.root() is purely about how we actually invoke rustc itself. Let's say we have a project like:

foo/rust/Cargo.toml // <- workspace root
foo/rust/bar/Cargo.toml
foo/third-party/more-rust // <- workspace member

With this PR as-is if you rename the rust directory to a different name (like rust2) then Cargo should avoid rebuilding everything. Where this PR as-is falls down, however, is if you rename the foo directory (let's say that's the overarching git repo for the whole project). If that happens then Cargo will rebuild everything because the more-rust crate is compiled as rustc /path/to/foo/third-party/... and renaming /path/to/foo would cause the output artifact to change.

I believe the fix for this would be to change how Cargo invokes rustc, instead invoking rustc ../third-party/more-rust/... which makes it possible at all for Cargo to avoid recompilation when the foo directory is renamed.

I also believe that such a fix would purely involve local modification to this logic. I think though that this may be rare enough that we may want to punt on this for now?

Does that makes sense? Curious what you think!

assert!(src.is_absolute());
match src.strip_prefix(ws_root) {
Ok(path) => (path.to_path_buf(), Some(ws_root.to_path_buf())),
Err(_) => (src.to_path_buf(), None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning None here means that rustc will use pretty-much arbitrary cwd, is that right? Maybe it's better to lock this down to ws_root even in this case, just to get a bit more determinism? Than we can return just PathBuf here and avoid that comment about hashing only .0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly plausible! I was originally thinking that this wouldn't matter but I think with plugins nowadays it could matter for sure. I'll change.

That being said though I don't think we can avoid hashing the second field of the return value because ws_root changes over time (if the whole dir is renamed) but we don't want that to cause recompiles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said though I don't think we can avoid hashing the second field of the return value because ws_root changes over time (if the whole dir is renamed) but we don't want that to cause recompiles.

The idea is that if we lock down cwd, then we don't need to return a pair here, we can return just the path.

/// when it was invoked.
///
/// The serialized Cargo format will contain a list of files, all of which are
/// relative if they're under `root`. or absolute if they're elsewehre.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, weren't depinfo files, produced by Cargo, used by some other tools? #3557

So, this change will break such clients :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sure!

I think though (at least the intention) that these files are just those in .fingerprint, so the ones like target/debug/foo.d should still be present unmangled from rustc itself.

@matklad
Copy link
Member

matklad commented Dec 11, 2017

Yeah, so I think anchoring to the package root in most places is the correct thing to do, but I still have a couple of question (left inline as well):

  1. Looks like dep-info files have external clients, so we can't just change the format as we see fit?
  2. Let's always lock down rustc cwd to some value?
  3. Implementation wise, I don't really get thrilled by the NonHashedPathBuf. If it's simple enough to just use relative paths for targets, I would probably gone that way.

@alexcrichton
Copy link
Member Author

For the last point about NonHashedPathBuf I actually did try to go down that road initially (actually required a huge amount of refactoring in toml/targets.rs but it ended up not panning out unfortunately. When we construct a Target I don't think we know enough about the workspace to store workspace-relative paths is sort of the tl;dr at this point

@matklad
Copy link
Member

matklad commented Dec 12, 2017

I don't think we know enough about the workspace to store workspace-relative paths is sort of the tl;dr at this point

Targets should be package relative, and not workspace relative. In theory, we do have necessary information:

  1. either the target has an explicitly declared path = /foo/bar/baz.rs in Cargo.toml, in which case we should use this path as is (it can be either relative or absolute, and we should not change that).

  2. or this is an implicit target, which we search by checking ./src/lib.rs, etc. In such case, we should be able to get the relative path out.

But I totally understand that it might be hard to actually implement.

So yeah, I think it might worth it to look into package relative paths for targets and locking down cwd for rustc, but, otherwise, r+ at will, this looks great! :)

This commit alters the format of the dependency info that Cargo keeps track of
for each crate. In order to be more resilient against directory renames and such
Cargo will now postprocess the compiler's dep-info output and serialize into its
own format. This format is intended to primarily list relative paths *to the
root of the relevant package* rather than absolute or relative to some other
location. If paths aren't actually relative to the package root they're still
stored as absolute, but there's not much we can do about that!
@alexcrichton
Copy link
Member Author

Oh sure yeah targets are package relative but we still don't want to hash the src_path field in Target (no matter what we do). That field actually has no bearing on the output artifact as it's never factored in. What actually matters, the path to rustc, is the one we hash.

It was after I ended up realizing that when I jettisoned the ~500 lines of changes to get the refactoring done as it ended up not being used in the end :(

In any case, thought I already did it, but just pushed up the changes for cwd from the package in non-workspace situations

@matklad
Copy link
Member

matklad commented Dec 12, 2017

@bors r+

@bors
Copy link
Contributor

bors commented Dec 12, 2017

📌 Commit f688e9c has been approved by matklad

@bors
Copy link
Contributor

bors commented Dec 12, 2017

⌛ Testing commit f688e9c with merge 4005bc4...

bors added a commit that referenced this pull request Dec 12, 2017
Avoid rebuilding a project when cwd changes

This commit is targeted at solving a use case which typically comes up during CI
builds -- the `target` directory is cached between builds but the cwd of the
build changes over time. For example the following scenario can happen:

1. A project is compiled at `/projects/a`.
2. The `target` directory is cached.
3. A new build is started in `/projects/b`.
4. The previous `target` directory is restored to `/projects/b`.
5. The build start, and Cargo rebuilds everything.

The last piece of behavior is indeed unfortunate! Cargo's internal hashing
currently isn't that resilient to changing cwd and this PR aims to help improve
the situation!

The first point of too-much-hashing came up with `Target::src_path`. Each
`Target` was hashed and stored for all compilations, and the `src_path` field
was an absolute path on the filesystem to the file that needed to be compiled.
This path then changed over time when cwd changed, but otherwise everything else
remained the same!

This commit updates the handling of the `src_path` field to simply ignore it
when hashing. Instead the path we actually pass to rustc is later calculated and
then passed to the fingerprint calculation.

The next problem this fixes is that the dep info files were augmented after
creation to have the cwd of the compiler at the time to find the files at a
later date. This, unfortunately, would cause issues if the cwd itself changed.
Instead the cwd is now left out of dep-info files (they're no longer augmented)
and instead the cwd is recalculated when parsing the dep info later.

The final problem that this commit fixes is actually an existing issue in Cargo
today. Right now you can actually execute `cargo build` from anywhere in a
project and Cargo will execute the build. Unfortunately though the argument to
rustc was actually different depending on what directory you were in (the
compiler was invoked with a path relative to cwd). This path ends up being used
for metadata like debuginfo which means that different directories would cause
different artifacts to be created, but Cargo wouldn't rerun the compiler!

To fix this issue the matter of cwd is now entirely excluded from compilation
command lines. Instead rustc is unconditionally invoked with a relative path
*if* the path is underneath the workspace root, and otherwise it's invoked as an
absolute path (in which case the cwd doesn't matter).

Once all these fixes were added up it means that now we can have projects where
if you move the entire directory Cargo won't rebuild the original source!

Note that this may be a bit of a breaking change, however. This means that the
paths in error messages for cargo will no longer be unconditionally relative to
the current working directory, but rather relative to the root of the workspace
itself. Unfortunately this is moreso of a feature right now rather than a bug,
so it may be one that we just have to stomach.

Closes #3273
@bors
Copy link
Contributor

bors commented Dec 12, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: matklad
Pushing 4005bc4 to master...

kngwyu added a commit to kngwyu/flycheck that referenced this pull request Mar 7, 2018
This commit fixes flycheck#1397.
By the recent change of cargo(rust-lang/cargo#4788),
flycheck can't detect the error file.
So I changed to use 'workspace_root' in 'cargo metadata's outputs,
as working directory.
If the user's cargo is a bit old and there's no 'workspace_root',
flycheck-rust-manifest-directory search a directory with 'Cargo.toml',
as ever.
fmdkdd pushed a commit to kngwyu/flycheck that referenced this pull request Mar 13, 2018
Due to a recent change to cargo [1], in a workspace setting (multiple
crates) filenames are now relative to the workspace root, whereas
previously they were relative to the crate.

This commit uses `cargo metadata` to figure out the workspace root, if
it exists.  If it doesn't, it fallbacks on the manifest directory.

[1]: rust-lang/cargo#4788
aeggenberger pushed a commit to aeggenberger/flycheck that referenced this pull request Mar 14, 2018
Due to a recent change to cargo [1], in a workspace setting (multiple
crates) filenames are now relative to the workspace root, whereas
previously they were relative to the crate.

This commit uses `cargo metadata` to figure out the workspace root, if
it exists.  If it doesn't, it fallbacks on the manifest directory.

[1]: rust-lang/cargo#4788
@ehuss ehuss added this to the 1.24.0 milestone Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Renaming the project root dir invalidates compilation cache
5 participants