Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use an additional, project-local copy of dependency trees #14283

Closed
Tracked by #14265
andrewrk opened this issue Jan 13, 2023 · 19 comments
Closed
Tracked by #14265

use an additional, project-local copy of dependency trees #14283

andrewrk opened this issue Jan 13, 2023 · 19 comments
Labels
enhancement Solving this issue will likely involve adding new logic or components to the codebase. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. zig build system std.Build, the build runner, `zig build` subcommand, package management
Milestone

Comments

@andrewrk
Copy link
Member

Extracted from #14265.

Terminology clarification:

  • A project is a directory of files, uniquely identified by their hash. Dependencies can export any number of artifacts and packages.
  • A dependency is a directed edge between projects. A project may depend on any number of projects. A project may be a dependency of any number of projects.
  • A package is a directory of files, along with a root source file that identifies the file referred to when the package is used with @import.
  • An artifact is a static library, a dynamic library, an executable, or an object file.

Currently, zig puts all fetched dependencies in the global zig cache, like this:

$GLOBAL_ZIG_CACHE/p/$DEPENDENCY_HASH/*

Then, dependencies are used directly from this directory, and shared among all projects.

This proposal is for zig build to additionally copy each dependency from the global cache into a project-local directory, like this:

$PROJECT_ROOT/zig-deps/$DEPENDENCY_NAME/*

A transitive dependency would look like this:

$PROJECT_ROOT/zig-deps/$NAME1/zig-deps/$NAME2/*

This would be similar to the node_modules directory from npm.

Motivations:

  • whether a build requires network access is independent from the state of the global cache system
  • it would be possible to wipe the global cache without forcing projects to re-fetch their dependencies. Similarly adding GC or LRU to the global cache would not sometimes delete dependencies for a particular project.
  • it would be possible to wipe a project's dependencies without wiping the global cache
  • it is easier to find dependencies by name, and locally patch them to test changes
  • temporary patches to dependencies would affect only one project; not the entire system globally
  • it would become an option to commit the zig-deps directory into source control, or to distribute a tarball that includes the dependencies
  • better compile errors when the lines point to dependencies; instead of getting a hash in the file name, you get the package name

Downsides:

  • multiple copies of things on disk, wasting disk space
  • somebody's going to suggest symlinking and all sorts of complicated stuff to go along with it
  • an additional directory alongside zig-out and zig-cache: zig-deps.

Open question: where to store the hash? It's nice to use the dependency name instead of the hash for the directory name, but it does leave the problem of how zig build should detect whether a dependency needs to be updated or not. It can always recompute hashes, but it should not be recomputing hashes on every zig build. Ideally, it would be only one open() call to open the directory of a dependency and find out whether the desired hash is present or not.

@andrewrk andrewrk added enhancement Solving this issue will likely involve adding new logic or components to the codebase. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. zig build system std.Build, the build runner, `zig build` subcommand, package management labels Jan 13, 2023
@andrewrk andrewrk added this to the 0.11.0 milestone Jan 13, 2023
@deflock
Copy link

deflock commented Jan 13, 2023

Is it correct that the idea is:

  • to have global store/cache with flat dependencies under their hashes (with no transitive dependencies, except when they are committed into the VCS)
  • in zig build you need to build a tree of all top-level dependencies + all their transitive dependencies, if any is missed refetch them, then in local directory to recreate this tree using copy/symlinks, then you can start building

@nektro
Copy link
Contributor

nektro commented Jan 13, 2023

This proposal is for zig build to additionally copy each dependency from the global cache into a project-local directory, like this:

$PROJECT_ROOT/zig-deps/$DEPENDENCY_NAME/*

A transitive dependency would look like this:

$PROJECT_ROOT/zig-deps/$NAME1/zig-deps/$NAME2/*

this approach for local dependencies is problematic, particularly on windows and is very noticeable in the node ecosystem when using the default npm package manager as that folder layout is how it sets up node_modules. Windows has a super short MAX_PATH length of only 260 characters[0][1] and so if you create a folder/file with a path longer than that it becomes impossible to move, rename, delete, or otherwise operate on.

a flat $NAME-$HASH would likely be better to avoid that

@ikskuh
Copy link
Contributor

ikskuh commented Jan 13, 2023

a flat $NAME-$HASH would likely be better to avoid that

Yep, i agree on that. A flat namespace will support deduplication and make it more clear which packages are actual dependencies, also keeps you aware of how many deps you actually have.

Duplicate names can be resolved by appending the hash, even if this would make it a bit weird for the user to debug. Another option would be to make transitive duplicates named ${primary_dep}-${secondary_dep}, but only on conflict

@motiejus
Copy link
Contributor

an additional directory alongside zig-out and zig-cache: zig-deps.

Why not store the deps in the local zig cache? It is ephemeral anyway, and can be viewed as a cache (because they can be recomputed/re-downloaded).

multiple copies of things on disk, wasting disk space

This is not a problem on some modern filesystems (definitely on btrfs and xfs, there may be more) due to copy-on-write. As of some recent coreutils even cp does reflinking by default (equivalent to cp --reflink, but with graceful fallback if the FS does not support it).

Reflinking is not the default mode in zig (I have that in my backlog), but will become at some point.

@andrewrk
Copy link
Member Author

Why not store the deps in the local zig cache? It is ephemeral anyway, and can be viewed as a cache (because they can be recomputed/re-downloaded).

because of this:

it would become an option to commit the zig-deps directory into source control, or to distribute a tarball that includes the dependencies

@silversquirl
Copy link
Contributor

One option to reduce disk space without symlinking (which causes all sorts of other issues and largely removes the benefit of this proposal) is to use hard links.

This is only possible on some operating systems, and only if the project and global cache are on the same drive, but it would solve the disk space problem with basically no added complexity.

@deflock
Copy link

deflock commented Jan 13, 2023

Npm with flat "tree" has a problem with importing transitive dependencies: project -> packageA -> packageB, with flat structure you're able to import packageB directly in project even if you haven't added projectB as dependency to project's build.zon.

I have a question what is my workflow will be if I need to patch one of transitive dependency? If local tree is created during zig build does it mean that I need to run zig build first with original dependency and only after that I will be able to modify this dependency and re-run zig build?

@marler8997
Copy link
Contributor

On idea could be that zig stores the compressed archives in the global cache, and only extracts them when installing to a local project. This would minimize disk space usage and helps ensure the original contents of the dependency remain intact (developer doesn't accidentally modify the contents of the files in the global cache). Also means there's no temptation or path to using "symlinks" to the global cache files.

@motiejus
Copy link
Contributor

On IRC I asked whether we should store decompressed archives in the global cache, so we can use more efficient means to decompress the file (I mentioned copy_file_range).

Turns out more efficient ways to decompress the file are not that more efficient. GNU tar wins (unsurprisingly, it is well optimized), followed by Andrew's stdlib implementation which uses pread/pwrite. sendfile and copy_file_range are a bit slower, definitely not worth the added complexity.

$ hyperfine --export-markdown table.md -r 5 -w 1 -p 'rm -fr ffmpeg' 'tar -xf ffmpeg.tar' 'tar -xf ffmpeg.tar.gz' './std-tar ffmpeg.tar' './std-tar ffmpeg.tar.gz' './maybe-faster sendfile ffmpeg.tar' './maybe-faster copy_file_range ffmpeg.tar'
Benchmark 1: tar -xf ffmpeg.tar
  Time (mean ± σ):     840.9 ms ±  34.8 ms    [User: 14.4 ms, System: 813.4 ms]
  Range (min … max):   800.5 ms … 890.5 ms    5 runs
 
Benchmark 2: tar -xf ffmpeg.tar.gz
  Time (mean ± σ):     900.8 ms ±  17.9 ms    [User: 294.7 ms, System: 858.3 ms]
  Range (min … max):   882.3 ms … 921.4 ms    5 runs
 
Benchmark 3: ./std-tar ffmpeg.tar
  Time (mean ± σ):     863.5 ms ±  19.1 ms    [User: 5.7 ms, System: 844.4 ms]
  Range (min … max):   838.1 ms … 886.8 ms    5 runs
 
Benchmark 4: ./std-tar ffmpeg.tar.gz
  Time (mean ± σ):      1.391 s ±  0.122 s    [User: 0.404 s, System: 0.974 s]
  Range (min … max):    1.179 s …  1.486 s    5 runs
 
Benchmark 5: ./maybe-faster sendfile ffmpeg.tar
  Time (mean ± σ):      1.089 s ±  0.049 s    [User: 0.005 s, System: 1.072 s]
  Range (min … max):    1.048 s …  1.143 s    5 runs
 
Benchmark 6: ./maybe-faster copy_file_range ffmpeg.tar
  Time (mean ± σ):      1.162 s ±  0.044 s    [User: 0.006 s, System: 1.092 s]
  Range (min … max):    1.121 s …  1.235 s    5 runs
 
Summary
  'tar -xf ffmpeg.tar' ran
    1.03 ± 0.05 times faster than './std-tar ffmpeg.tar'
    1.07 ± 0.05 times faster than 'tar -xf ffmpeg.tar.gz'
    1.29 ± 0.08 times faster than './maybe-faster sendfile ffmpeg.tar'
    1.38 ± 0.08 times faster than './maybe-faster copy_file_range ffmpeg.tar'
    1.65 ± 0.16 times faster than './std-tar ffmpeg.tar.gz'
Command Mean [ms] Min [ms] Max [ms] Relative
tar -xf ffmpeg.tar 840.9 ± 34.8 800.5 890.5 1.00
tar -xf ffmpeg.tar.gz 900.8 ± 17.9 882.3 921.4 1.07 ± 0.05
./std-tar ffmpeg.tar 863.5 ± 19.1 838.1 886.8 1.03 ± 0.05
./std-tar ffmpeg.tar.gz 1390.6 ± 122.5 1178.6 1486.2 1.65 ± 0.16
./maybe-faster sendfile ffmpeg.tar 1088.6 ± 48.7 1047.6 1142.7 1.29 ± 0.08
./maybe-faster copy_file_range ffmpeg.tar 1161.7 ± 43.8 1121.2 1235.0 1.38 ± 0.08

Files:
std-tar.txt
maybe-faster.txt

Built with:

for f in maybe-faster.zig std-tar.zig; do zig build-exe -lc -OReleaseFast $f; done

@LordMZTE
Copy link
Contributor

zig_cache

@andrewrk andrewrk modified the milestones: 0.11.0, 0.12.0 Jul 20, 2023
@dravenk
Copy link
Contributor

dravenk commented Mar 15, 2024

zig_cache

😂😂😂

zig fetch --global-cache-dir vendor --save https://github.com/andrewrk/mime/archive/refs/tags/1.0.0.tar.gz 
image

I recently started using zig and investigating how to use vendor to store all dependencies like composer or cargo . I not sure if this is the right way to use it.

@dravenk
Copy link
Contributor

dravenk commented Mar 15, 2024

zig fetch --global-cache-dir vendor xxx.tag.gz
zig build --global-cache-dir vendor
image

🤔🤔🤔 This does not built using vendor.

@andrewrk
Copy link
Member Author

The zig-cache directory is intended to be excluded from source control.

@moosichu
Copy link
Contributor

moosichu commented Jun 1, 2024

I've created a PR for this #20150. Although that the moment the package hash is currently just stored "as-is" in zig-deps. I think this has some limitations, and having thought about it, I'm thinking of implementing something like:

  • If the package has a manifest, use the manifest name it in zig-deps:
    • If there is already a package with that name in zig-deps (but a different version) - each folder (including the original conflicting one) is renamed to <manifest name>-<@version>, and if there are still hash differences, then it should be <manifest-name>-<@version>-<hash>.
  • If the package doesn't have a manifest, then we use the name provided in the build.zig.zon for those referenced directly by the root project. Otherwise we just use the hash for transitive dependencies as there is no unique logical name that can be used.

Potentially using a lock file or some-such to keep track of things and having the ability to force hashes to be re-checked based on the folder contents of the packages.

I'm sure there's a fair few things I have failed to consider... but hopefully by addressing feedback on this I can hopefully construct a mergeable PR that solves all the problems that need to be solved.

I don't know if this is overcomplicating things relative to:

This proposal is for zig build to additionally copy each dependency from the global cache into a project-local directory, like this:

$PROJECT_ROOT/zig-deps/$DEPENDENCY_NAME/*

A transitive dependency would look like this:

$PROJECT_ROOT/zig-deps/$NAME1/zig-deps/$NAME2/*

This would be similar to the node_modules directory from npm.

But I do think the concern of how deep those paths could get on Windows to be legitimate. And it would be nice to avoid copying packages locally within a repository at a minimum. (although my stealth hot-take is that transitive dependencies are way more hassle then they are worth, and libraries should be really, really judicial about using them at all in the first place).

@k0tran
Copy link

k0tran commented Aug 16, 2024

Hi, I'm here with real-word use-case scenario. Recently attempted to package zls for sisyphus. The thing is that sisyphus requires so that source tarball/git repo builds without internet connection. It would be nice to have something like cargo vendor or go mod vendor to add/update all dependencies at once and then proceed with offline build.

@nektro
Copy link
Contributor

nektro commented Aug 16, 2024

zig build --fetch will already do that, and is orthogonal to this feature request

edit: the issue was also that they were doing a http request in their configure phase

@k0tran
Copy link

k0tran commented Aug 17, 2024

zig build --fetch will already do that, and is orthogonal to this feature request

zig fetch (or zig build --fetch) uses zig-cache directory. As Andrew said it is not intended to be commited to the source tree. zig-deps folder with all dependencies names would be nice to have.

Though I admit it seems it is possible to build zig project in offline with --global-cache-dir, fetch and --system. I will open issue in zls regarding offline build.

@k0tran
Copy link

k0tran commented Aug 17, 2024

After tinkering for a while found out that there is a flag -Dversion_data_path which can be set to local langref.html.in. Thus zls can be built offline! Some reasonable names instead of hashes in the project tree would be nice though :)

@mlugg mlugg moved this to Proposals in Package Manager Aug 22, 2024
@andrewrk
Copy link
Member Author

andrewrk commented Sep 7, 2024

Closing in favor of #20180. I think this use case is solved with a combination of that, plus some follow-up tooling, plus the --system flag that is already implemented.

@andrewrk andrewrk closed this as not planned Won't fix, can't repro, duplicate, stale Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Solving this issue will likely involve adding new logic or components to the codebase. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. zig build system std.Build, the build runner, `zig build` subcommand, package management
Projects
Status: Proposals
Development

Successfully merging a pull request may close this issue.