Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an equivalent to CCACHE_BASEDIR #35

Open
Tracked by #4
luser opened this issue Nov 30, 2016 · 28 comments
Open
Tracked by #4

Implement an equivalent to CCACHE_BASEDIR #35

luser opened this issue Nov 30, 2016 · 28 comments

Comments

@luser
Copy link
Contributor

luser commented Nov 30, 2016

CCACHE_BASEDIR allows ccache to get cache hits for the same source files stored at different paths, which is useful for developers building from different source trees:
https://ccache.samba.org/manual.html#_configuration_settings

We want this so that we can get cross-branch cache hits on our buildbot builds, which have the branch name in their build directory, and we also want this so we can get cache hits from the S3 cache for local developers' builds.

@posix4e
Copy link
Contributor

posix4e commented Dec 18, 2016

Can't we always strip the cwd by default and have this for override? Maybe yet another env

@gumpt
Copy link

gumpt commented Apr 10, 2017

I am looking at this seriously, and after doing some cleanup of my newbie Rust, it should be ready for a pull request tonight.

@jwatt
Copy link
Contributor

jwatt commented Nov 2, 2017

Do out of source object directories present a problem given the way the Firefox source is currently built. Say I have SCCACHE_BASEDIR=/home/jwatt and the following directory structure:

home
  jwatt
    src
      mozsrc1
      mozsrc2
    obj
      mozsrc1-debug
      mozsrc2-debug

When compiling /home/jwatt/obj/mozsrc1-debug/layout/svg/SVGImageContext.cpp (or rather, the unified file that includes SVGImageContext.cpp), the compiler is invoked from /home/jwatt/obj/mozsrc1-debug/layout/svg with the include path -I/home/jwatt/src/mozsrc1/layout/svg. If that path is rewritten to the relative path -I../../../../mozsrc1/layout/svg we are no better off since the mozsrc1 in the path will prevent a cache hit when the same file is built in mozsrc2.

To avoid that it would seem like sccache would need to actually cd to SCCACHE_BASEDIR, rewrite the paths to be relative to that directory, and invoke the compiler from there (if that's possible without causing knock-on issues).

@ncalexan
Copy link
Member

In #104 (comment), @glandium said,

Furthermore, due to how sccache is used for Firefox builds, it would be better to have several base directories (an arbitrary number of them).

That seems incompatible with @jwatt's approach in #35 (comment).

@glandium, can you explain what you were thinking, and comment on @jwatt's approach?

@MarshallOfSound MarshallOfSound mentioned this issue Apr 16, 2018
2 tasks
@magreenblatt
Copy link

In regards to @ncalexan's comment above:

I'm involved in a project that builds both locally for developers and using automated builders and we would like to share the cache among all users.

On Windows we currently pass full include paths via the compiler command-line. Automated builders download Visual Studio and Windows SDK as an archive and extract to a specific directory. Individual developers often have Visual Studio and Windows SDK installed in the default system locations. Our source code is then downloaded in a completely separate directory

For example, a value like this on the automated builders:

"-imsvcc:\\users\\User\\cache\\vs-professional-15.5.2-15063.468\\vc\\tools\\msvc\\14.12.25827\\atlmfc\\include"

Would be equivalent to this on a developer machine:

"-imsvcC:\\Program Files (x86)\\Microsoft Visual Studio\\2017\\Professional\\VC\\Tools\\MSVC\\14.12.25827\\atlmfc\\include"

Being able to specify multiple basedir values would allow us to share the cache without requiring everyone to install files in the same locations.

@martinm82
Copy link

On Linux you could solve that quite easily by keeping your build tools in a docker image including sccache and mount the sources into the container an all your machines at the same location.

@niosHD
Copy link

niosHD commented May 29, 2018

To add an additional use case to this discussion. I am currently experimenting with building our internal libraries and tools with conan and just realized that effectively using sccache in this environment would need support for something like multiple base directories or something like path ignore patterns.

In short, when building a package foo/1.2.3@user/channel with conan the source is located in $CONAN_HOME/data/foo/1.2.3/user/channel/source and the build directory is $CONAN_HOME/data/foo/1.2.3/user/channel/build. Given that these paths also affect the compiler flags (i.e., include paths), I therefore only get cache hits when I build exactly the same package and version in the same namespace. Using $CONAN_HOME/data/<name>/<version>/<user>/<channel> as base directory would enable cache hits, across different versions of the package.

More than one base directory/ignore pattern is needed in the common case where the package additionally depends on other conan packages. For example, when foo/1.2.3@user/channel depends on bar/4.5.6@otheruser/otherchannel, additional include and link paths to $CONAN_HOME/data/bar/.5.6/otheruser/otherchannel/package/<someSHA>/ are provided to the compiler. Again, these paths have to be ignored in order to enable sccache hits.

@doronbehar
Copy link

Question: Was it ever considered to use sha256 sums for the files instead of their full paths? I know this is a different approach then what's taken in the ccache project but I think it's much more reliable. The idea is that the same way that every file path is mapped to an object in the cache, it's sha256sum could be used instead.

@luser
Copy link
Contributor Author

luser commented Jun 24, 2019

The primary problem is that we hash commandline arguments to the compiler, and often full source paths wind up there. We did take a change in #208 to fix one issue around this where full paths wound up in the C compiler's preprocessor output. I suspect we could fix this fairly easily for most cases nowadays--it'd mostly just mean filtering commandline arguments that have full paths.

@doronbehar
Copy link

So does it mean that when sccache is wrapping GCC, the path arguments it is given are never absolute and when it wraps rustc it may or may not be given absolute paths?

My suggestion, is that whether the paths are absolute or not, we can read the files in the input (assuming we can know the caller's working directory in case the paths are relative) and compute a sha256 sum of all of them. Then, we can map this sha256 hash to the cached output instead of the command line arguments to the output. This way, whenever there's a match in the hash, the cache could be used.

I'm no Rust developer and I'm not familiar at all with the internals of this project so maybe my idea will be hard to implement but I hope my explanation is good enough.

BTW, implementing this idea will mean that existing caches of users will no longer be valid so in a certain sense it'll be backwards incompatible, but not in usage.

@luser
Copy link
Contributor Author

luser commented Jun 25, 2019

@doronbehar It's a bit more complicated than that and I'd suggest you read the existing implementation first. We are already using the input files as part of the hash. For C/C++ compilation the hash_key function is the most interesting part:

sccache/src/compiler/c.rs

Lines 482 to 511 in 5855673

/// Compute the hash key of `compiler` compiling `preprocessor_output` with `args`.
pub fn hash_key(compiler_digest: &str,
language: Language,
arguments: &[OsString],
extra_hashes: &[String],
env_vars: &[(OsString, OsString)],
preprocessor_output: &[u8]) -> String
{
// If you change any of the inputs to the hash, you should change `CACHE_VERSION`.
let mut m = Digest::new();
m.update(compiler_digest.as_bytes());
m.update(CACHE_VERSION);
m.update(language.as_str().as_bytes());
for arg in arguments {
arg.hash(&mut HashToDigest { digest: &mut m });
}
for hash in extra_hashes {
m.update(hash.as_bytes());
}
for &(ref var, ref val) in env_vars.iter() {
if CACHED_ENV_VARS.contains(var.as_os_str()) {
var.hash(&mut HashToDigest { digest: &mut m });
m.update(&b"="[..]);
val.hash(&mut HashToDigest { digest: &mut m });
}
}
m.update(preprocessor_output);
m.finish()
}

For Rust compilation the generate_hash_key function is what you want:

sccache/src/compiler/rust.rs

Lines 926 to 1129 in 5855673

fn generate_hash_key(self: Box<Self>,
creator: &T,
cwd: PathBuf,
env_vars: Vec<(OsString, OsString)>,
_may_dist: bool,
pool: &CpuPool)
-> SFuture<HashResult>
{
let me = *self;
let RustHasher {
executable,
host,
sysroot,
compiler_shlibs_digests, #[cfg(feature = "dist-client")]
rlib_dep_reader,
parsed_args:
ParsedArguments {
arguments,
output_dir,
externs,
crate_link_paths,
staticlibs,
crate_name,
crate_types,
dep_info,
emit,
color_mode: _,
},
} = me;
trace!("[{}]: generate_hash_key", crate_name);
// TODO: this doesn't produce correct arguments if they should be concatenated - should use iter_os_strings
let os_string_arguments: Vec<(OsString, Option<OsString>)> = arguments.iter()
.map(|arg| (arg.to_os_string(), arg.get_data().cloned().map(IntoArg::into_arg_os_string))).collect();
// `filtered_arguments` omits --emit and --out-dir arguments.
// It's used for invoking rustc with `--emit=dep-info` to get the list of
// source files for this crate.
let filtered_arguments = os_string_arguments.iter()
.filter_map(|&(ref arg, ref val)| {
if arg == "--emit" || arg == "--out-dir" {
None
} else {
Some((arg, val))
}
})
.flat_map(|(arg, val)| Some(arg).into_iter().chain(val))
.map(|a| a.clone())
.collect::<Vec<_>>();
// Find all the source files and hash them
let source_hashes_pool = pool.clone();
let source_files = get_source_files(creator, &crate_name, &executable, &filtered_arguments, &cwd, &env_vars, pool);
let source_files_and_hashes = source_files
.and_then(move |source_files| {
hash_all(&source_files, &source_hashes_pool).map(|source_hashes| (source_files, source_hashes))
});
// Hash the contents of the externs listed on the commandline.
trace!("[{}]: hashing {} externs", crate_name, externs.len());
let abs_externs = externs.iter().map(|e| cwd.join(e)).collect::<Vec<_>>();
let extern_hashes = hash_all(&abs_externs, pool);
// Hash the contents of the staticlibs listed on the commandline.
trace!("[{}]: hashing {} staticlibs", crate_name, staticlibs.len());
let abs_staticlibs = staticlibs.iter().map(|s| cwd.join(s)).collect::<Vec<_>>();
let staticlib_hashes = hash_all(&abs_staticlibs, pool);
let creator = creator.clone();
let hashes = source_files_and_hashes.join3(extern_hashes, staticlib_hashes);
Box::new(hashes.and_then(move |((source_files, source_hashes), extern_hashes, staticlib_hashes)|
-> SFuture<_> {
// If you change any of the inputs to the hash, you should change `CACHE_VERSION`.
let mut m = Digest::new();
// Hash inputs:
// 1. A version
m.update(CACHE_VERSION);
// 2. compiler_shlibs_digests
for d in compiler_shlibs_digests {
m.update(d.as_bytes());
}
let weak_toolchain_key = m.clone().finish();
// 3. The full commandline (self.arguments)
// TODO: there will be full paths here, it would be nice to
// normalize them so we can get cross-machine cache hits.
// A few argument types are not passed in a deterministic order
// by cargo: --extern, -L, --cfg. We'll filter those out, sort them,
// and append them to the rest of the arguments.
let args = {
let (mut sortables, rest): (Vec<_>, Vec<_>) = os_string_arguments.iter()
// We exclude a few arguments from the hash:
// -L, --extern, --out-dir
// These contain paths which aren't relevant to the output, and the compiler inputs
// in those paths (rlibs and static libs used in the compilation) are used as hash
// inputs below.
.filter(|&&(ref arg, _)| {
!(arg == "--extern" || arg == "-L" || arg == "--out-dir")
})
// A few argument types were not passed in a deterministic order
// by older versions of cargo: --extern, -L, --cfg. We'll filter the rest of those
// out, sort them, and append them to the rest of the arguments.
.partition(|&&(ref arg, _)| arg == "--cfg");
sortables.sort();
rest.into_iter()
.chain(sortables)
.flat_map(|&(ref arg, ref val)| {
iter::once(arg).chain(val.as_ref())
})
.fold(OsString::new(), |mut a, b| {
a.push(b);
a
})
};
args.hash(&mut HashToDigest { digest: &mut m });
// 4. The digest of all source files (this includes src file from cmdline).
// 5. The digest of all files listed on the commandline (self.externs).
// 6. The digest of all static libraries listed on the commandline (self.staticlibs).
for h in source_hashes.into_iter().chain(extern_hashes).chain(staticlib_hashes) {
m.update(h.as_bytes());
}
// 7. Environment variables. Ideally we'd use anything referenced
// via env! in the program, but we don't have a way to determine that
// currently, and hashing all environment variables is too much, so
// we'll just hash the CARGO_ env vars and hope that's sufficient.
// Upstream Rust issue tracking getting information about env! usage:
// https://github.com/rust-lang/rust/issues/40364
let mut env_vars: Vec<_> = env_vars.iter()
// Filter out RUSTC_COLOR since we control color usage with command line flags.
// rustc reports an error when both are present.
.filter(|(ref k, _)| k != "RUSTC_COLOR")
.cloned()
.collect();
env_vars.sort();
for &(ref var, ref val) in env_vars.iter() {
// CARGO_MAKEFLAGS will have jobserver info which is extremely non-cacheable.
if var.starts_with("CARGO_") && var != "CARGO_MAKEFLAGS" {
var.hash(&mut HashToDigest { digest: &mut m });
m.update(b"=");
val.hash(&mut HashToDigest { digest: &mut m });
}
}
// 8. The cwd of the compile. This will wind up in the rlib.
cwd.hash(&mut HashToDigest { digest: &mut m });
// Turn arguments into a simple Vec<OsString> to calculate outputs.
let flat_os_string_arguments: Vec<OsString> = os_string_arguments.into_iter()
.flat_map(|(arg, val)| iter::once(arg).into_iter().chain(val))
.collect();
Box::new(get_compiler_outputs(&creator, &executable, &flat_os_string_arguments, &cwd, &env_vars).map(move |mut outputs| {
if emit.contains("metadata") {
// rustc currently does not report rmeta outputs with --print file-names
// --emit metadata the rlib is printed, and with --emit metadata,link
// only the rlib is printed.
let rlibs: HashSet<_> = outputs.iter().cloned().filter(|p| {
p.ends_with(".rlib")
}).collect();
for lib in rlibs {
let rmeta = lib.replacen(".rlib", ".rmeta", 1);
// Do this defensively for future versions of rustc that may
// be fixed.
if !outputs.contains(&rmeta) {
outputs.push(rmeta);
}
if !emit.contains("link") {
outputs.retain(|p| *p != lib);
}
}
}
let output_dir = PathBuf::from(output_dir);
// Convert output files into a map of basename -> full path.
let mut outputs = outputs.into_iter()
.map(|o| {
let p = output_dir.join(&o);
(o, p)
})
.collect::<HashMap<_, _>>();
let dep_info = if let Some(dep_info) = dep_info {
let p = output_dir.join(&dep_info);
outputs.insert(dep_info.to_string_lossy().into_owned(), p.clone());
Some(p)
} else {
None
};
let mut arguments = arguments;
// Always request color output, the client will strip colors if needed.
arguments.push(Argument::WithValue("--color", ArgData::Color("always".into()), ArgDisposition::Separated));
let inputs = source_files.into_iter().chain(abs_externs).chain(abs_staticlibs).collect();
HashResult {
key: m.finish(),
compilation: Box::new(RustCompilation {
executable: executable,
host,
sysroot: sysroot,
arguments: arguments,
inputs: inputs,
outputs: outputs,
crate_link_paths,
crate_name,
crate_types,
dep_info,
cwd,
env_vars,
#[cfg(feature = "dist-client")]
rlib_dep_reader,
}),
weak_toolchain_key,
}
}))
}))
}

@jcar87
Copy link

jcar87 commented Jun 26, 2019

@luser thanks for pointing to the hash_key function in c.rs, as I'm also looking into this.

From a limited test run that I've performed, the contents of the arguments variable passed to hash_key don't have paths in tem (such as -I, or the path to the input file).

Instead, I can see that on the function call:

sccache/src/compiler/c.rs

Lines 270 to 277 in 5855673

let key = {
hash_key(&executable_digest,
parsed_args.language,
&parsed_args.common_args,
&extra_hashes,
&env_vars,
&preprocessor_result.stdout)
};

only "common args" are passed. When debugging my case, I can see that these are flags such as warning flags, optimization level, -std C++ standard and so on. None of these contain paths in them, although obviously that could be the case of the project I'm compiling.

Digging a bit deeper, I can see that the the flags with paths in them wind up as preprocessor flags. And the input file itself is also handled differently. Am I correct to assume the following:

    1. The input file itself is hashed, rather than its location, so it's location is not relevant for the purposes of hashing
    1. The flags used for preprocessing are not hashed, but rather then preprocessed output is. In which case, I suspect things like __FILE__ macros expanding can end up resulting in a different hash when same sources are in different places?

@MMcKester
Copy link

Hi, I am wondering, what is the status of SCCACHE_BASEDIR. I have two machines compiling the same project under different paths and was wondering how I can avoid cache misses

@glandium
Copy link
Collaborator

One way to avoid the problem altogether is to pass flags to your compiler that normalize file paths. e.g. -fdebug-prefix-map with GCC or --remap-path-prefix with rustc.

@MMcKester
Copy link

But aren't the command lines hashed in sccache? The paths would be still different

You mean like this?
sccache clang -c test.cpp --remap-pathprefix...

@MMcKester
Copy link

Is there any update to this?

@monsdar
Copy link

monsdar commented Sep 9, 2020

Another user wanting to use sccache with Conan reporting in, thanks @niosHD for doing some groundwork!

I guess a working (easy to activate?) Conan integration for sccache could be a very useful feature for many users. As Conan uses often-changing absolute build-pathes, but always in the same manner, it should be possible to implement it in a standard way.

Conan is also getting traction, so a feature like that could be a win-win for both tools.

@froydnj
Copy link
Contributor

froydnj commented Sep 9, 2020

But aren't the command lines hashed in sccache? The paths would be still different

Presumably you'd also have to add special handling in sccache for those arguments so that, e.g., for -fdebug-prefix-map=OLD=NEW, you'd effectively wind up hashing -fdebug-prefix-map=NEW.

Conan is also getting traction, so a feature like that could be a win-win for both tools.

Mozilla currently has very little interest in a feature like this, so Mozilla is unlikely to do development on this particular issue.

@Allen-Webb
Copy link

I gave --remap-path-prefix a try and it causes everything to be a cache miss likely because of this code:

take_arg!("--remap-path-prefix", OsString, CanBeSeparated('='), TooHard),

I think one possible way to solve this issue would be to implement path prefix support for sccache:

  1. Require the path prefix mapping to be reversible (two mappings shouldn't collide) for caching
  2. Include the prefix mapping as needed in metadata for build reproduce-ability, but exclude the mapping from any hashes.
  3. Any paths included in hashes should be filtered through the prefix mapping.
  4. The prefix mapping would need to apply to outputs as well as inputs, this may require support / changes in rustc.

@luser
Copy link
Contributor Author

luser commented Jan 15, 2021

FWIW, the distributed compilation code already uses --remap-path-prefix internally to make sure that paths in the compiler outputs match what would have been produced if the compilation had been run locally:

dist_arguments.push(format!("--remap-path-prefix={}={}", &dist_path, local_path));

@milahu

This comment was marked as spam.

@glance-
Copy link

glance- commented May 22, 2023

We use --remap-path-prefix in all of our projects so we always produce reproducible builds. It's unfortunate that this prevents sccache from being useful.

@khssnv
Copy link

khssnv commented Oct 9, 2023

Anyone could share how to use --remap-path-prefix to avoid the absolute paths problem when using sccahe with local disk storage? Thanks!

@jeffparsons
Copy link

Sccache would be really helpful for local builds of Rust projects, especially when you have a great many with common dependencies (🙋‍♂️) and especially in a team environment (🙋‍♂️).

Unfortunately the lack of this feature makes that quite painful in practice. So I'd love to better understand its status. In particular, is it stalled because:

  • There's just nobody with time to implement it who is also invested in this use case?
  • There are difficult design questions that don't yet have clear answers?
  • There's some kind of technical or philosophical blocker to it being accepted by the primary maintainers?

I'm just having trouble figuring out which one (or N) it is.

Thanks! 🙇‍♂️

@chevdor
Copy link

chevdor commented May 8, 2024

Rust projects, especially when you have a great many with common dependencies (🙋‍♂️) and especially in a team environment (🙋‍♂️).

While sccache is awesome, there are a few caveats with Rust and it is not all 🦄 unfortunately.
I think it is good to be aware of those before hoping it solves all the problems (sorry 🤗 ).

In a nutshell and unless you take good care of aligning things, sccache will likely not work as you'd expect if:

  • members of your team use different OS
  • members of your team have different user accounts (I bet it is the case...)
  • even using the same OS, it will likley NOT work unless you take special measures on MacOS for instance (ie creating a hard link and letting Rust use that location for all members)
  • If using a mix of Rustc versions, sccache will not help

That's a lot of caveats and in the meantime, the Rust compiler improved a lot. Nowadays, unless you have a very sopecific env that ticks all the checks above, you will likely be better not using sccache.
I am saying this based on tests I made in 2021, testing locally, using MacOS and tested with Redis, minio and memcached.

That may explain the low activity.

In the end, if sccache works for you, great, if not, you can also check cargo remote, which can also help for teams.

@jeffparsons
Copy link

In a nutshell and unless you take good care of aligning things, sccache will likely not work as you'd expect if:

In my case all these things are in fact aligned except for user accounts and the other reasons for paths to differ. Is there a problem with having different user accounts other than it making paths different?

As for differing paths, I had thought that is what this whole issue is about. Perhaps I misunderstood?

I'm thinking of making my own rustc wrapper that takes care of setting up an environment for sccache (files in a consistent location, maybe even doing a chroot or something if necessary) that ensures a cache hit whenever it should be possible. It might end up being the simplest path forward, as it sidesteps the questions about what is the correct way to handle all the messy details of path remapping.

@chevdor
Copy link

chevdor commented May 8, 2024

In my case all these things are in fact aligned except for user accounts and the other reasons for paths to differ. Is there a problem with having different user accounts other than it making paths different?

Yes, this is why I mentioned it :)
Cache won't hit ie Alice has projects under $HOME/foo and Bob under $HOME/foo.
On Linux, it is rather simple (technically) if you can align all users to having their projects in something like /path/to/projects and not $HOME/some/path.

On MacOS it is less trivial since MacOS does NOT let users created Hard links, especially not at the root unless you do some trickery that can be dangerous (ie brick temperary your disk, ask me how I know....).

I'm thinking of making my own rustc wrapper ...

That sounds interesting and I would love to hear about the journey. I definitely do NOT to discourage you, just set expectations so you know a few things that sounds trivial... will not be. I am not saying you cannot get it to work if indeed your team is not to "wildly spread" regarding how the env are set.

the simplest path forward,...

One of the main issue you will face is that Alice and Bob will not use (all the time) the same Rust version. They may even both be on nightly but Alice updates in the morning and Bob in the evening. That's however still an easy and good case: Alice will build up the cache abnd Bob will benefit.

Issues arise if for whatever reason, those users decides to use different versions and it is hard to force a team (presumably working on several projects...) to use a unified and synchronized version of Rust. Not impossible and some tools can help but it s often not the case by default.

@jeffparsons
Copy link

That sounds interesting and I would love to hear about the journey.

Well, I guess if I have any success you're bound to hear about it. 😀 (But yes, I would report back here as well.) I have some experience building vaguely related sorts of tools, so I expect complications but am also confident I could find ways around them. It's more a question of getting around to it... (very limited time right now)

Rust version

In my case this is simple — pretty much everything is in a monorepo with a single pinned version that gets bumped by a bot (via a pull request) when a new stable Rust release come out. We used to have some exceptions, but I'm pretty sure we deleted the last one recently. In this regard at least, you could say that we are playing on easy mode due to the rigid homogeneity of our dev environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.