Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster incremental imports #46

Conversation

brentleyjones
Copy link
Contributor

  • "Only transfer units if they are newer" is an alternative implementation of Only rewrite unit files when outdated #32.
  • "Allow specifying direct paths to files to import" allows directly specifying which paths have changed for even more granularity. It also strides on unit paths instead of indexstores, which in our case (a single indexstore) allowed near full parallelism (4x faster for 4 cores, stride of 64 with 4568 files).

@DavidGoldman
Copy link
Contributor

Maybe we should remove the per-index store parallelism in favor of the unit path parallelism? Also, have you tested in making the stride boil down to N threads (e.g. if you have 20 files, 4 threads, stride = 20 / 4 =5)?

Also, I think the Only transfer units if they are newer feature should be optional (behind a setting), WDYT?

@brentleyjones
Copy link
Contributor Author

Maybe we should remove the per-index store parallelism in favor of the unit path parallelism?

For sure. It might make sense for that to be a subsequent PR, since the code around unit path parallelism depends on the fact that I'm passing in the direct paths.

Also, have you tested in making the stride boil down to N threads (e.g. if you have 20 files, 4 threads, stride = 20 / 4 =5)?

I did not test that yet.

Also, I think the Only transfer units if they are newer feature should be optional (behind a setting), WDYT?

Yeah, I could be behind that, as long as it was the default.

@DavidGoldman
Copy link
Contributor

DavidGoldman commented Jun 18, 2020

Maybe we should remove the per-index store parallelism in favor of the unit path parallelism?

For sure. It might make sense for that to be a subsequent PR, since the code around unit path parallelism depends on the fact that I'm passing in the direct paths.

Also, have you tested in making the stride boil down to N threads (e.g. if you have 20 files, 4 threads, stride = 20 / 4 =5)?

I did not test that yet.

Worth testing, have a feeling it might be the most optimal.

Also, I think the Only transfer units if they are newer feature should be optional (behind a setting), WDYT?

Yeah, I could be behind that, as long as it was the default.

How are you planning on using this? For our integration I was planning on having per-target/source index zip files, which we re-import (after extracting into a single index store dir) if they have changed. I guess this approach makes sense as long as you're having Bazel build this index locally (so the timestamps are set when it is built)?

@brentleyjones
Copy link
Contributor Author

How are you planning on using this? ... I guess this approach makes sense as long as you're having Bazel build this index locally (so the timestamps are set when it is built)?

We ensure that all of our indexes are in a single store dir (via https://github.com/target/rules_swift/commit/dc0c9ef0db74e24895d06cf27ade0321371e3ecf for incremental builds and copying them from individual targets if downloading from a cache). So for us the timestamps are correct since they are the result of a fresh build.

@DavidGoldman
Copy link
Contributor

How are you planning on using this? ... I guess this approach makes sense as long as you're having Bazel build this index locally (so the timestamps are set when it is built)?

We ensure that all of our indexes are in a single store dir (via target/rules_swift@dc0c9ef for incremental builds and copying them from individual targets if downloading from a cache). So for us the timestamps are correct since they are the result of a fresh build.

Ah I see, so you produce the index store as a side-effect, completely external to Bazel, even on CI? When you initially download from the cache you preserve timestamps on all of the unit/record files?

@brentleyjones
Copy link
Contributor Author

Ah I see, so you produce the index store as a side-effect, completely external to Bazel, even on CI?

We produce the indexes via Bazel, but that patch allows normal remote-cache downloading of the indexes, and when that happens we move those indexes to the same location the patch would put them (the moving is done completely external to Bazel though). The way the patch works though might be described as "completely external to Bazel" since Bazel doesn't track those indexes when building locally.

When you initially download from the cache you preserve timestamps on all of the unit/record files?

Hopefully the above explains that Bazel still downloads the indexes. I'm not sure how it determines timestamps in that case, but even if it's the current time, from the PoV of the build, that is correct, since it had to rebuild and the timestamps would be the current time in that case as well.

@DavidGoldman
Copy link
Contributor

DavidGoldman commented Jun 18, 2020

Ah I see, so you produce the index store as a side-effect, completely external to Bazel, even on CI?

We produce the indexes via Bazel, but that patch allows normal remote-cache downloading of the indexes, and when that happens we move those indexes to the same location the patch would put them (the moving is done completely external to Bazel though). The way the patch works though might be described as "completely external to Bazel" since Bazel doesn't track those indexes when building locally.

When you initially download from the cache you preserve timestamps on all of the unit/record files?

Hopefully the above explains that Bazel still downloads the indexes. I'm not sure how it determines timestamps in that case, but even if it's the current time, from the PoV of the build, that is correct, since it had to rebuild and the timestamps would be the current time in that case as well.

I see, I was thinking of doing something similar, with the optimization instead being we don't reimport per-target index stores unless they have changed (e.g. different timestamp on the zip itself). Then we don't need the optimization you're adding here although it could be useful if you're remapping a locally made index store.

It seems like your approach may be using folders though (tree artifacts) in which case the timestamps might be a bit different.

index-import.cpp Outdated Show resolved Hide resolved
@DavidGoldman
Copy link
Contributor

Ah I see, so you produce the index store as a side-effect, completely external to Bazel, even on CI?

We produce the indexes via Bazel, but that patch allows normal remote-cache downloading of the indexes, and when that happens we move those indexes to the same location the patch would put them (the moving is done completely external to Bazel though). The way the patch works though might be described as "completely external to Bazel" since Bazel doesn't track those indexes when building locally.

When you initially download from the cache you preserve timestamps on all of the unit/record files?

Hopefully the above explains that Bazel still downloads the indexes. I'm not sure how it determines timestamps in that case, but even if it's the current time, from the PoV of the build, that is correct, since it had to rebuild and the timestamps would be the current time in that case as well.

I see, I was thinking of doing something similar, with the optimization instead being we don't reimport per-target index stores unless they have changed (e.g. different timestamp on the zip itself). Then we don't need the optimization you're adding here although it could be useful if you're remapping a locally made index store.

It seems like your approach may be using folders though (tree artifacts) in which case the timestamps might be a bit different.

Does putting the up-to-date check behind a flag (enabled by default) seem reasonable then?

@brentleyjones
Copy link
Contributor Author

Does putting the up-to-date check behind a flag (enabled by default) seem reasonable then?

Yes. I can add a flag for it, unless @kastiglione thinks it shouldn't have one.

@keith
Copy link
Member

keith commented Aug 28, 2020

flag sounds fine

@brentleyjones
Copy link
Contributor Author

Now that I'm back from paternity leave I'll get around to this soon!

Brentley Jones added 2 commits November 18, 2020 08:32
This allows specifying a subset of an index store to import, if you know which files changed.
@segiddins
Copy link

ping @keith @brentleyjones

@brentleyjones
Copy link
Contributor Author

I'll poke around with this next week. Sorry for the delay.

@brentleyjones
Copy link
Contributor Author

The first commit has landed with #52. I'll make a new PR soon (sooner than last time 😉) for the remaining commit.

@jerrymarino
Copy link
Contributor

@brentleyjones I've added the ability to import indexes from a specific compilation, this is useful for managing a global index in Bazel with minimal overhead.

I also realized that we can use this approach to incrementally import units and records when they are compiled in the build system: as an aspect or reading in a BEP stream.

Since the unit contain a pointer to the record, it just pulls dep records for the unit.

@jerrymarino
Copy link
Contributor

#53 - this one should be able to use the other bits you've added for incremental too 👍

@brentleyjones
Copy link
Contributor Author

I think #53 covers the remaining part of this for now. If there is still desire for whatever was left here, let me know and I'll work on getting a new PR rolling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants