Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP registry implementation #10470

Merged
merged 1 commit into from
Mar 24, 2022
Merged

HTTP registry implementation #10470

merged 1 commit into from
Mar 24, 2022

Conversation

arlosi
Copy link
Contributor

@arlosi arlosi commented Mar 9, 2022

Implement HTTP registry support described in RFC 2789.

Adds a new unstable flag -Z http-registry which allows cargo to interact with remote registries served over http rather than git. These registries can be identified by urls starting with sparse+http:// or sparse+https://.

When fetching index metadata over http, cargo only downloads the metadata for needed crates, which can save significant time and bandwidth over git.

The format of the http index is identical to a checkout of a git-based index.

This change is based on @jonhoo's PR #8890.

cc @Eh2406

Remaining items:

  • Performance measurements
  • Make unstable only
  • Investigate unification of download system. Probably best done in separate change.
  • Unify registry tests (code duplication in http_registry.rs)
  • Use existing on-disk cache, rather than adding a new one.

@rust-highfive
Copy link

r? @ehuss

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 9, 2022
@jonhoo
Copy link
Contributor

jonhoo commented Mar 10, 2022

I would strongly recommend not benchmarking against raw.githubusercontent.com, mainly because it has a lot of throttling applied, so you won't get meaningful benchmark results. @Mark-Simulacrum set up an endpoint for me to benchmark against back then, so may be able to get you a new one to use this time around!

@jonhoo
Copy link
Contributor

jonhoo commented Mar 10, 2022

Oh, and just for reference, I dumped the benchmarking suite I used for the prototype implementation here: https://github.com/jonhoo/cargo-http-bench. You'll probably have to adjust paths/URLs to match your setup, but at least most of the test cases are there.

@Mark-Simulacrum
Copy link
Member

Mark-Simulacrum commented Mar 10, 2022

I would strongly recommend not benchmarking against raw.githubusercontent.com, mainly because it has a lot of throttling applied, so you won't get meaningful benchmark results. @Mark-Simulacrum set up an endpoint for me to benchmark against back then, so may be able to get you a new one to use this time around!

I think the old endpoint is still around (but obviously with stale data), though I forget the exact name -- maybe @jonhoo remembers :)

In any case, my recollection is that all I had to do was just copy crates.io-index to a place that's internet-accessible (much like the githubusercontent.com link); it may be that e.g. copying crates.io-index to a GitHub Pages site is good enough. I can also refresh and/or setup a new copy of the index on Rust's infrastructure (i.e., Cloudfront), but that might take at least a few days; it would be a one-off copy without significant investment work. I expect that's probably good for benchmarking in any case, though.

@jonhoo
Copy link
Contributor

jonhoo commented Mar 10, 2022

In that case I think the URL is already somewhere in the repo I linked above, and the benchmark might just work as is!

@jonhoo
Copy link
Contributor

jonhoo commented Mar 10, 2022

Looks like I also had a local nginx proxy for doing benchmarks with various throughout throttles. I'll post that config tomorrow!

@jonhoo
Copy link
Contributor

jonhoo commented Mar 10, 2022

Added the nginx config.

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 10, 2022

Avoid storing raw HTTP data on disk

I just read all the comments on #8890. This recommendation was made hear #8890 (comment). It definitely sounds like a good follow up PR, especially on windows where the filesystem tends to be a bottle neck, but not a blocker to landing this PR.

Performance optimization around updating only single crates

This one can also be a follow up PR, if it is at all tricky.

@arlosi
Copy link
Contributor Author

arlosi commented Mar 11, 2022

I repeated @jonhoo's test to compare git vs the original (greedy) fetch implementation and the new resolver based one using the local nginx server limited to 20MBps and 150 rps.

The resolver-based implementation is significantly faster, primarily because it's fetching way fewer crates. The greedy fetch is fetching about 1000 crates for each test, while the resolver fetch is usually 100-200.

Both are faster than git. If we increase the rps limit, or reduce the bandwidth then both http implementations do even better compared to git.

Note that the git-timings aren't run for most of the test cases, because they'll always be the same.
image

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 11, 2022

That looks really good! Thank you for the analysis!

@arlosi arlosi marked this pull request as ready for review March 11, 2022 19:49
@Eh2406
Copy link
Contributor

Eh2406 commented Mar 11, 2022

The PR description used to have text about how to test against raw.githubusercontent.com. However that will give 429 rate limiting errors. If it is used to wildly it will get GitHub mad at us. I removed it from the description so that it doesn't end up a permanent part of the git history.

For posterity here it is:


To test with crates.io, the following raw.githubusercontent.com source can be used. Note that this is not a proper CDN and doesn't support ETags or Last-Modified.

[source.crates-io]
replace-with = "crates-io-http"

[source.crates-io-http]
registry = "sparse+https://raw.githubusercontent.com/rust-lang/crates.io-index/master"

@bors
Copy link
Contributor

bors commented Mar 17, 2022

☔ The latest upstream changes (presumably #10482) made this pull request unmergeable. Please resolve the merge conflicts.

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 18, 2022

Looking forward to either the code duplication being dealt with, or documentation about why it's not worth the effort. Then I will do a full review.

@arlosi
Copy link
Contributor Author

arlosi commented Mar 18, 2022

I'm working on an update to remove the duplication around caching and tests.

There are some differences in test output between git and http backends - specifically around errors such as network unavailability. I'll try to get them as close as possible, but there may be some tests that need to be split.

simple(cargo_stable);
}

fn simple(cargo: fn(&Project, &str) -> Execs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeated boilerplate in these tests is acceptable. And far better than it was. And still kind of unfortunate. I don't know enough about proc macros, I wonder if we could have #[cargo_test(HTTP)] fn simple generate all three functions?

Definitely not something we have to fix now, but something I want to remember to discuss.

Copy link
Contributor

@Eh2406 Eh2406 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a significant amount of time reading this over carefully today and I think it's ready to be merged. Especially given that it is adding to an unstable feature, we can always change or fix things.

But because of the importance of this change, I want to talk over at tomorrow's cargo meeting before we merge.

if ptr == 0 {
f(None)
} else {
// Safety: * `ptr` is only set by `set` below which ensures the type is correct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually the type I was worried about here. It's the lifetime. What prevents us using downloads after it's been freed? It took me trying to write a counter example to realize why this is safe. ptr is only set during the call back in set. The way we use it is a little confusing, because curl needs us to have "callbacks at a distance". But our callbacks are only called within the call back to set.

I don't know how to document this better, nor do I think it has to be done before merge.

src/cargo/sources/registry/http_remote.rs Show resolved Hide resolved
Copy link
Contributor

@epage epage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is less a review and me coming up to speed on cargo. None of this is critical, so if people can't get to these questions, thats ok.

For what I understand of the code, this change looks correct to me.

src/cargo/sources/registry/http_remote.rs Show resolved Hide resolved
Comment on lines 135 to +141
Ok(SourceId::new(SourceKind::Registry, url, None)?
.with_precise(Some("locked".to_string())))
}
"sparse" => {
let url = string.into_url()?;
Ok(SourceId::new(SourceKind::Registry, url, None)?
.with_precise(Some("locked".to_string())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, how come sparse passes in the full string while registry passes in the trailing bit (url)? Do these show up somewhere that we have to maintain compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http registries are still SourceKind::Registry, so the sparse+ prefix is how it knows we're in http mode. Otherwise it couldn't tell them apart.

I may be able to make the "registry" one also pass the registry+ prefix for consistency -- as long as it doesn't show up somewhere external as you mentioned.

src/cargo/sources/path.rs Show resolved Hide resolved
Comment on lines +303 to +317
if let Some(index_version) = index_version {
trace!(
"local cache of {} is available at version `{}`",
path.display(),
index_version
);
if self.is_fresh(path) {
return Poll::Ready(Ok(LoadResponse::CacheValid));
}
} else if self.fresh.contains(path) {
debug!(
"cache did not contain previously downloaded file {}",
path.display()
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is some context I'm missing but the way this reads is a bit odd.

self.fresh.contains(path) is a subset of self.is_fresh. Why are we only doing a subset of the checks in the else and why are we still processing things (and logging that we don't have it) when its in self.fresh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.fresh only keeps track of which files have been downloaded in the current session, not the contents of the file.

The "cache did not contain previously downloaded file {}" is a case when the cache isn't working, so we go ahead and get the file again. This could possibly be made into an assert, since it shouldn't happen...

Comment on lines +364 to +369
if self.config.offline() {
return Poll::Ready(Err(anyhow::anyhow!(
"can't download index file from '{}': you are in offline mode (--offline)",
self.url
)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, do we need to also check self.config.frozen() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line below ops::http_handle(self.config)?; will do the appropriate check for frozen (and offline).

So for consistency, we should either remove this check, or add the frozen one.

}
}

mod tls {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my information, I see that src/cargo/core/package.rs also uses TLS for sharing state with downloads. How come cargo uses that instead of an Arc<Mutex<T>>?

Copy link
Contributor Author

@arlosi arlosi Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure. I think an Arc<Mutex<>> would also work here. This change is based on what was done in package.rs.

Comment on lines +407 to +418
if let Some(index_version) = index_version {
if let Some((key, value)) = index_version.split_once(':') {
match key {
ETAG => headers.append(&format!("If-None-Match: {}", value.trim()))?,
LAST_MODIFIED => {
headers.append(&format!("If-Modified-Since: {}", value.trim()))?
}
_ => debug!("unexpected index version: {}", index_version),
}
}
}
handle.http_headers(headers)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm understanding correctly, index_version for a registry is registry defined?

This was something that wasn't obvious to me and took some digging. I had assumed this was a schema version number rather than a data revision number.

Digging in some more, it seems like this is pre-established within cargo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_version is used for cache validation.

For git-based registries index_version is the commit id of the HEAD of the repo when the cache file was created. This allows cargo to skip talking to libgit2 when the git repo hasn't changed.

For http-based registries index_version is a string ETag: <etag value> or Last-Modified: <last-modified value> or Unknown (in that preference order) based on what the server sent when fetching the index. This allows cargo to send If-None-Match or If-Modified-Since headers and receive an HTTP 304 (not modified) when fetching metadata.

The cache in http mode also allows --offline mode to work (where the cache is used regardless of the index_version)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming may no longer match how we use it. If you can think of a better name now would be a great time to change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming may no longer match how we use it

The schema versioning added in https://rust-lang.github.io/rfcs/3143-cargo-weak-namespaced-features.html#index-changes is what led to my confusion. it also doesn't help that we are storing structured data in a string.

As for names, its hard to say.
Unfortunately, I can't come up with a clear winner. The best I can come up with is index_digest as that doesn't convey an Ord version scheme but a Eq identifier.

If you can think of a better name now would be a great time to change it.

For me, I disagree as making unrelated improvements to a PR can derail it and make it harder to review. Soon after this PR, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do people like any of these names?
cache_key
cache_identifier
cache_identity
cache_version
cache_tag
index_cache_<one of the above options>

Copy link
Contributor Author

@arlosi arlosi Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also doesn't help that we are storing structured data in a string

That's true. I can probably make an improvement here.

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 24, 2022

I am going to merge, it is all unstable. Big thanks to @arlosi for all the work!
@bors r+

We can always merge bug fixes and clean ups.

@bors
Copy link
Contributor

bors commented Mar 24, 2022

📌 Commit 412b633 has been approved by Eh2406

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 24, 2022
@bors
Copy link
Contributor

bors commented Mar 24, 2022

⌛ Testing commit 412b633 with merge 1366225...

@arlosi
Copy link
Contributor Author

arlosi commented Mar 24, 2022

Thanks for the review! I've been unavailable this week with limited internet access, so I'll get a follow up PR created early next week to address any open issues from this one.

@bors
Copy link
Contributor

bors commented Mar 24, 2022

☀️ Test successful - checks-actions
Approved by: Eh2406
Pushing 1366225 to master...

@bors bors merged commit 1366225 into rust-lang:master Mar 24, 2022
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 31, 2022
Update cargo

13 commits in 109bfbd055325ef87a6e7f63d67da7e838f8300b..1ef1e0a12723ce9548d7da2b63119de9002bead8
2022-03-17 21:43:09 +0000 to 2022-03-31 00:17:18 +0000
- Support `-Zmultitarget` in cargo config (rust-lang/cargo#10473)
- doc: Fix document url for libcurl format (rust-lang/cargo#10515)
- Fix wrong info in "Environment variables" docs (rust-lang/cargo#10513)
- Use the correct flag in --locked --offline error message (rust-lang/cargo#10512)
- Don't treat host/target duplicates as duplicates (rust-lang/cargo#10466)
- Unstable --keep-going flag (rust-lang/cargo#10383)
- Part 1 of RFC2906 - Packages can inherit fields from their root workspace (rust-lang/cargo#10497)
- Remove unused profile support for -Zpanic-abort-tests (rust-lang/cargo#10495)
- HTTP registry implementation (rust-lang/cargo#10470)
- Add a notice about review capacity. (rust-lang/cargo#10501)
- Add tests for ignoring symlinks (rust-lang/cargo#10047)
- Update doc string for deps_of/compute_deps. (rust-lang/cargo#10494)
- Consistently use crate::display_error on errors during drain (rust-lang/cargo#10394)
@ehuss ehuss added this to the 1.61.0 milestone Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants