Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace resolved field by hash #64

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Replace resolved field by hash #64

wants to merge 4 commits into from

Conversation

doxiaodong
Copy link


# Detailed design

Replace the `resolved` by a `hash` field.
Copy link
Member

@bestander bestander May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hash would not be enough.

resolved "https://registry.yarnpkg.com/abbrev/-/abbrev-1.1.0.tgz#d0554c2256636e2f56e7c2e5ad183f859428d81f"

should be converted into

  resolved "/abbrev/-/abbrev-1.1.0.tgz#d0554c2256636e2f56e7c2e5ad183f859428d81f".

Where https://registry.yarnpkg.com will be substituted by default and can be overridden with a setting in .yarnrc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it would require everyone to update their lockfiles to have the partial tarballs URLs, that may limit the usefulness of this feature.
How about having a config that would swap registries, e.g.

registry-replace "https://registry.yarnpkg.com=https://registry.npm.taobao.com;https://registry....=...."

It would allow people from remote areas seamlessly switch registries for existing projects

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bestander as noted in my motivation comment, retaining the registry URL within in a yarn.lock file could leak company repositories, so entirely removing them from a yarn.lock file would be nice.

Actually it would require everyone to update their lockfiles to have the partial tarballs URLs, that may limit the usefulness of this feature.

I can see that it would require intervention by each project to upgrade their yarn.lock files to follow the new approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you still release your yarn.lock to open source then you want the https://registry.yarnpkg.com domain to be present.
registry-replace "https://registry.yarnpkg.com...." setting could be used to replace the URLs at fetch time and your project could be configured not to leak artifactory domain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why insn't the hash enough @bestander? Isn't the url completely inferred from repository+name+version?
From package.json, yarn can already find what is the correct url (based on the registry in the configuration), so it should be the same in this case?

Here the hash would be present to ensure the artefact is the right one only, not for localisation at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tarball URL is returned by npm backend https://github.com/yarnpkg/yarn/blob/master/src/resolvers/registries/npm-resolver.js#L189, Yarn is not constructing it.

I don't think URL reconstruction is justified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a way, the whole point of this RFC to discuss do that: constructing the URL on the fly :)

When you say it is the npm backend constructing the URL, do you mean that IF we wanted to reconstruct it, we would need to store in the lockfile that it is a npm dependency so that we know we need the npm backend to reconstruct the full URL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True :)

I mean that npm is responsible for sending us the URL, for compatibility reasons we probably don't want to move this logic to Yarn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you don't need to move it in yarn, you simply need to ask npm again every time :)

I understand this is a tricky place to make changes, but personally, I feel like yarn is broken by design by the choice of storing urls directly in the lock file (and I'm curious to see how npm is going to tackle this with their new coming version).
As it currently is, it cannot scale, you can't use proxy caches easily, you potentially expose private information publicly, and so on.

The alternative of using a registry-rewrite is interesting too though, so maybe it will be simpler (in terms of retro-compatibility) to use that and keep the registry URL in the lockfile.
Having the URL in the lockfile also has the merit to ease the use of multiple repositories (not proxies of existing ones) side to side (even though I'm not sure how you can add a dependency with a different registry easily currently with yarn).
In that case the repository base-url (e.g., https://registry.npmjs.org) would act as a kind of identifier of a given repository, and users could configure mirrors (e.g., https://registry.npm.taobao.org) for these repositories (i.e, the rewrite you talked about) in their configuration.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that npm is responsible for sending us the URL, for compatibility reasons we probably don't want to move this logic to Yarn.

Okay, so now I see what you mean, and I agree that entirely removing the resolution URL from yarn.lock would mean that yarn would need to generate its own URLs internally, which has all the downsides of additional code for a spec yarn doesn't control.

If the URL must follow a specific spec, could the logic used to generate the value of dist.tarball be refactored out of npm and yarn, and shared?

The alternative of using a registry-rewrite is interesting too though, so maybe it will be simpler

Another thought that came to mind is what happens where there are multiple registries that need to be overwritten? I've been assuming a developer only needed to override a single registry, but what if a yarn.lock has references to multiple artifact registries? A developer would need to grep the yarn.lock file, enumerate all registries used, and write registry-rewrite configs for each.


In yarn.lock, the `resolved` field includes registry such as `https://registry.npmjs.org`.
In China, most developers will set it to `https://registry.npm.taobao.org` for speed; but it seems slow for travis-ci and circleci.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add as a motivation that the current approach leads to developers leaking their internal artifact repository sites to the public internet via yarn.lock if they have their company's artifact repository configured in a .npmrc or .yarnrc file.

# Detailed design

Replace the `resolved` by a `hash` field.
The `url` in `resolved` is unnecessary; keeping `hash` is enough.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get an example of what the resolved field would look like following this standard? (Kind of like your examples in yarnpkg/yarn#3330)


# Unresolved questions

No questions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an open question, I would like to ask how will this be rolled out to all Yarn using projects? Will Yarn replace the entire yarn.lock file? Will Yarn only use the new format for changed resolutions in the yarn.lock file?


Just set the registry before `yarn install` if you do not want to use `https://registry.npmjs.org`.
Or use `yarn install --registry=https://registry.npm.taobao.org`.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably need to emphasize on the documentation website that install will use the .yarnrc configured registry, the command line configured registry, or will fallback to the default.


# How We Teach This

Just set the registry before `yarn install` if you do not want to use `https://registry.npmjs.org`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to the npm registry needs to be replaced with a reference to the Yarn registry - https://registry.yarnpkg.com

@bestander
Copy link
Member

We may want to split the discussion into several RFCs.
Feel free to spawn your own version

@szimek
Copy link

szimek commented May 12, 2017

In my company we had a bit different issue. It's a bit complicated ;)

We've got our own registry that stores our own private scoped packages and also acts as a proxy to the public npm/yarn registry for public packages. The problem is that developers could configure yarn in 2 ways:

  1. use our own registry for all packages
  2. use public npm/yarn registry for public packages and our private registry for our private scoped packages (this setting is useful if someone publishes public packages to public npm registry)

Then we noticed that this causes a mess in yarn.lock files, because after upgrading public packages, some developers had resolved urls pointing to the public registry and some to our private one. The second setting is better for our developers, but we'd prefer that our CI always used our own registry... With urls in yarn.lock file, it's not really possible.

@nighca
Copy link

nighca commented May 12, 2017

I think storing hash instead of the whole url is good for keeping yarn's logic simple & robust. Caching url may gain some performance benefit, but that forces yarn to deal with the situation that different registries exists in one project because that may happen.

Supporting multiple registries in one project is a huge challenge, considering different use cases in different team. I don't think yarn is ready to face it when yarn includes the field resolved in yarn.lock file.

IMP, we should abandon this feature (storing url including registry for each package) first. Then reconsider the use cases & support it in a more reasonable way. Of course keeping compatibility will be the hardest part. Maybe there have to be a breaking change.

Just for more discussion:)

@ghost
Copy link

ghost commented May 12, 2017

In my company we had a bit different issue. It's a bit complicated ;)

@szimek we have a similar setup at our company.

We have two internal registries, once that is purely a mirror of the public npm registry, and a second registry that contains just our company's scope, @mycompany (as an example).

We then have developers setup the following in their .npmrc file:

registry=http://artifactory.example.com/artifactory/api/npm/npm
@mycompany:registry=http://artifactory.example.com/artifactory/api/npm/npm-mycompany/

We decided against a virtual registry built on top of the other two (We are not sure whether we will ever create the single virtual registry).

because after upgrading public packages, some developers had resolved urls pointing to the public registry and some to our private one.

This happens at our company as well. We don't force developers to use the registries I listed above (except for the scoped registry, but that's the only place those are published).

but we'd prefer that our CI always used our own registry

Eventually we plan on taking the Facebook approach and isolating our CI environment from the internet, which means that any external references will fail those CI jobs.

@bestander
Copy link
Member

In terms of complexity I would not sign up on this drastic change yet.
There are many hidden drawbacks not listed in this RFC and code complexity and compatibility need to be considered.

I suggest focusing on specific use cases and find a way to address them with the existing structure instead of starting a significant revamp.

@OpenGG
Copy link

OpenGG commented Jun 2, 2017

Nay.

Changing resolved to hash will break backwards-compatibility.

@doxiaodong As to your issue - registry mirror config may be overridden by resolved in yarn.lock, it be addressed and improved by yarn implementation. I think changes to RFC may not be necessary.

@ghost
Copy link

ghost commented Jun 2, 2017

registry mirror config may be overridden by resolved in yarn.lock

@OpenGG are you saying that we should manually replace registry URL paths in yarn.lock?

@OpenGG
Copy link

OpenGG commented Jun 2, 2017

@destroyerofbuilds No. That's a temporary hack with current yarn implementation.

NPM@5 introduces interesting changes related to this issue, I suggest we all give it a try before jumping to any conclusion.

If you generated your package lock against registry A, and you switch to registry B, npm will now try to install the packages from registry B, instead of A. If you want to use different registries for different packages, use scope-specific registries (npm config set @myscope:registry=https://myownregist.ry/packages/). Different registries for different unscoped packages are not supported anymore.
http://blog.npmjs.org/post/161081169345/v500

@bestander
Copy link
Member

Yep, that is what we need to do

@ghost
Copy link

ghost commented Jun 2, 2017

If you generated your package lock against registry A, and you switch to registry B, npm will now try to install the packages from registry B, instead of A.

Should that be pulled out into a separate RFC?

Lastly, how does the npm approach address item 2 in this RFC - the current approach leads to developers leaking their internal artifact repository sites to the public internet via yarn.lock if they have their company's artifact repository configured in a .npmrc or .yarnrc file.

@bestander
Copy link
Member

bestander commented Jun 2, 2017 via email

@ghost
Copy link

ghost commented Jun 2, 2017

:1. I would go ahead and start a new RFC.

👍 I like the idea of a small, targeted, RFC that helps to bring Yarn in-line with npm behavior that, honestly, seems quite reasonable.

How about still using public URLs in yarn.lock but if you have a private
repository in the config the actual requests will have a URL replaced
during runtime?

Please bare with me as I verbalize this a little to see if we're on the same page.

I have registry=http://artifactory.example.com/artifactory/api/npm/npm in my .yarnrc configuration file that points to my company's super secret internal artifact repository.

However, when Yarn generates the yarn.lock file, Yarn uses the default yarn registry URL?

Later, when I run yarn, Yarn extracts the registry configuration option from .yarnrc, and re-writes the registry URL using my configuration override?

So technically the fetches happen against the registry URL I've specified, even though the yarn.lock specifies other URLs?

That behavior, if I'm understanding it correctly, lines up with the behavior of npm mentioned by @OpenGG.

That seems intuitive, since npm@5 does most of that already, and the aforementioned new RFC would bring Yarn inline with that behavior already.

The only divergence is that npm@5 writes out the override from .npmrc to the package-lock.json file.

Therefore, the only change would be that the override in .yarnrc would never be applied to the default when writing out unscoped packages to yarn.lock. I assume, too, that would apply to scopes?

@bestander
Copy link
Member

bestander commented Jun 2, 2017 via email

@victornoel
Copy link
Contributor

And thus how do we support the use of multiple registry then? For example for packages deployed on a tertiary public repository?

One of the important point of the RFC was that we want the yarn.lock to contain coherent information, for example for community users of a company developed application.
Example:

  • if a package from the public yarn/npm registry is installed normally, its resolved field would contain the yarnpkg url (which could be overridden at runtime as discussed by @destroyerofbuilds just above)
  • but if a package is available only from a tertiary repository, you don't want to have the yarnpkg url in the yarn.lock, it doesn't make any sense, since the package does not exists on npm/yarnpkg.

So how do you differentiate them? I see some ideas:

  • using scoped registry as with npm (@myscope:registry=https://myownregist.ry/packages/) will make the registry be written in the yarn.lock, but then it would be not so coherent with the way the registry setting, as detailed by @destroyerofbuilds above, would work. So it's not a good solution IMHO.
  • using a parameter to the add command and/or in the version field of the package.json (something like version@url for example, in a way it is a bit similar to using urls directly or github url, but with yarn resolution involved in the process)
  • having another configuration in the user settings file to map some package and/or scope to a specific repository, to be used as-is in the lock file.

For the record, this is how maven tackles this situation (and maven has a lot of this kind of use with multiple repository containing different packages):

  • you only record the name and version of a package to identify it (and there is no lock file because there is only fixed exact versions used)
  • you have a set of repository configured, and they are all queried for the package until one is found (so it means there is not hash or whatever to ensure the downloaded package is actually the same as before, this is often a problem with these use case with maven!)
  • you also have a mechanism to specify mirrors: you map a repository (by its id, because repositories have id, but it would be the same to refer to it via its url) to a mirror, and maven will replace at runtime the url of the first by the second (a bit like the currently discussed mechanism by @destroyerofbuilds above)

If we were to make a parallel to the present discussion:

  • we have repository urls written in the lock file, so we know for each package what is its original canonical repository as well as its hash
  • we would need a way to map a repository (expected to be present in a lock file) to another one: the use case is mirrors
  • we would need a way to specific a repository for a package or a set of packages (scope): the use case is tertiary repositories

This is just some food for thought for whoever write another RFC on the matter, not a fully thought solution :)

@bestander
Copy link
Member

great points, @victornoel!

@BYK BYK self-assigned this Sep 29, 2017
@BYK BYK removed their assignment Oct 6, 2017
@Gudahtt
Copy link
Member

Gudahtt commented Aug 19, 2018

This discussion is a bit fractured (here, here, here, and here), so I've attempted to summarize my interpretation of it here.

Purpose of the resolved field

As far as I can tell, this field serves two purposes:

  • caching the URL for the package, to avoid having to calculate it again
  • caching the exact version of the package (i.e. the hash) as reported by the registry

We still need the latter functionality of course. It's not clear to me whether we need the URL or not. There was a comment here suggesting we could recalculate the URL at runtime, but keeping the cached URL may have performance advantages.

Problems with the current resolved field

  1. The registry in the lockfile doesn't always match the registry currently in use.
    • This results in changes to the lockfile after an install
    • This forces Yarn to resolve that version all over again, which slows down the installation process and could produce an inconsistent result
  2. It's not clear which registry should be used
    • In most cases, the registry specified by the current user's registry setting should be used, falling back to the default registry
    • In some cases, packages should be downloaded from alternate registries (e.g. private packages)
  3. The lockfile shows which registry was used by the user or system that generated it
    • This is misleading and confusing
    • This could leak information that users would prefer not to share

npm's solution

  • They effectively ignores the registry specified in the resolved field. This solves the 1st problem.
  • They use the registry configured by the current user, falling back to the default registry. They also allow you to use alternate registries for specific scopes. This solves the 2nd problem
  • The 3rd problem remains unaddressed.

Yarn's solution

In the short term (i.e. pre-v2.0), the approach taken by npm seems like the best route to take. PR #84 addresses that, but I think we should adjust it to match npm's solution more closely. I might put up a separate PR to discuss that further.

If we're willing to consider making a breaking change, then we should remove the registry from the resolved field. Removing it would simplify and shorten the lockfile, eliminate the potential for confusion, and and avoid leaking potentially secret information. It would solve the 3rd problem mentioned above.

I can't think of any downsides to doing this, aside from the fact that it would be a breaking change. There are breaking changes to the lockfile being considered anyway (see here). If those changes go ahead, it would be worthwhile making this change as well. If not, then I'm not sure it's worth the effort.

@Gudahtt
Copy link
Member

Gudahtt commented Aug 20, 2018

After further consideration, adopting the behavior of npm v5 might be a breaking change after all (as mentioned here).

What I had in mind was for Yarn to do as npm v5 does, which is ignore the registry specified in the resolved field. The registry used would be the one specified in .yarnrc, or the default registry if that was not set.

Yarn already supports custom registries and even scoped registries via .yarnrc. If they were used over the yarn.lock entries, our problem would be solved. That would be a lot simpler to use than the proposed solution, which is to add an additional override-registries config setting.

Unfortunately, it is possible that people are relying upon the current behavior. The most common situation in which you'd want to use an alternate registry for a select group of packages is for internal packages, which should be scoped under an organization name. Scoped registries address that use case quite well. But if anybody out there is relying upon alternate registries for non-scoped packages, that behavior would break if we were to make .yarnrc registry override the registry in the resolved field.

That is, Yarn would have to drop support for alternate registries for non-scoped packages, which are currently supported.

This situation is a bit muddy, because I don't know that support for alternate registries for non-scoped packages was intended, or whether it works that way incidentally. I also don't know if anybody would care if that behavior was lost, whereas it is abundantly clear that a resolution to this problem would be appreciated.

@arcanis
Copy link
Member

arcanis commented Aug 20, 2018

If we're willing to consider making a breaking change, then we should remove the registry from the resolved field. Removing it would simplify and shorten the lockfile, eliminate the potential for confusion, and and avoid leaking potentially secret information. It would solve the 3rd problem mentioned above.

Yes! That would be great. It's actually already planned for the 2.0, but noone started working on it yet: yarnpkg/yarn#5892

Scoped registries address that use case quite well. But if anybody out there is relying upon alternate registries for non-scoped packages, that behavior would break if we were to make .yarnrc registry override the registry in the resolved field.

For scoped packages, I think the expectation is that those users would just have to share the same configuration. They can easily do this by checking in a yarnrc in the repository that would only contain the scrope registries.

For non-scoped packages, it never was supported in the first place (after all the lockfile has a big "DO NOT EDIT" notice at the very top) and I don't think we should try to support it. Especially since this feature is scheduled to be released with a major version bump.

@Gudahtt
Copy link
Member

Gudahtt commented Aug 20, 2018

For non-scoped packages, it never was supported in the first place (after all the lockfile has a big "DO NOT EDIT" notice at the very top) and I don't think we should try to support it.

Fantastic

Edit: Though... this could be accomplished without editing the lockfile. Installing a package, changing the registry, then installing another package could result in a lockfile with alternate registries for non-scoped packages.

That seems like a bit of an edge case though.

Especially since this feature is scheduled to be released with a major version bump.

Removing the registry from the lockfile is scheduled for then. I was hoping we could let the registry in .yarnrc override the one in the lockfile sooner than that, as a non-breaking change.

@arcanis
Copy link
Member

arcanis commented Aug 20, 2018

I was hoping we could let the registry in .yarnrc override the one in the lockfile sooner than that, as a non-breaking change.

We already have kinda-breaking changes in master (the new integrity field which, without being breaking per-se, will cause a bunch of changes into most lockfiles), so I think it would be better for the next release to be the 2.0, ideally.

@Gudahtt
Copy link
Member

Gudahtt commented Aug 21, 2018

it would be better for the next release to be the 2.0, ideally.

Well, that's exciting! That leaves us with little time to update these RFCs. I've just put up a PR to rewrite the existing PR to more closely match npm@5's behavior: #99 .

As for this PR... I did find a downside to removing the registry from the resolved field. Or rather, I looked into why npm chose to preserve it.

They are using the resolved field to cache the tarball URL. The registry is an important part of that cache, because different registries might server the same tarball from different paths. The field is no longer useful as a cache without the registry.

It looks like npm requests metadata from the registry if the registry doesn't match the resolved field, meaning there's an extra round-trip involved in downloading the package. Extra round-trips are not good.

What is the impact to Yarn if we stop using resolved as a cache? Is that information cached somewhere else already perhaps? If not, perhaps we should make sure it is before removing it from the lockfile.

@arcanis
Copy link
Member

arcanis commented Aug 21, 2018

Hmm I'm not entirely sure I follow - we currently have these informations: registry configuration + resolved field (registry + url). Assuming the registries from the resolved fields should always match the registry configuration (or be updated to match it), then this information is strictly redundant, and even if we were to use it as a cache key (I don't think we do, iirc the cache key is based on the package name + package version) we would be able to reconstruct it at runtime.

Does that make sense?

@Gudahtt
Copy link
Member

Gudahtt commented Aug 21, 2018

After tracing the install process taken by Yarn, it looks like it does use that field as a cache. I can't see how it would construct it without asking the registry for metadata.

Here is an example of the URL we're discussing: https://registry.yarnpkg.com/@gulp-sourcemaps/map-sources/-/map-sources-1.0.0.tgz. This is stored in the resolved field, along with a hash.

In the resolve step of an installation, Yarn will collect a manifest for each dependency being requested. This manifest always includes a remote field, which itself has a reference field. This is the same URL as the one stored in the resolved field, and is later used by the fetcher during the fetch step.

The resolve step constructs a PackageResolver, which creates a PackageRequest for each dependency requested. This PackageRequest ends up constructing a registry resolver to gather the information needed to download the package from a registry (at least for registry requests). If the lockfile exists, the information stored there is used. Otherwise, a request is sent to the registry to obtain the package metadata, including the URL in question.

That is the round-trip that I am concerned we might be adding to each installation if we remove the field altogether, without caching that information elsewhere.

Edit: I should add that conceptually I still agree that the field should be removed. But assuming I'm right about there being a performance impact, we would have to accept that cost.

We could cache that information somewhere else instead; either as part of this change, or at a later date.

@Gudahtt
Copy link
Member

Gudahtt commented Aug 22, 2018

I've done a bit more investigation on the option of generating the tarball URL using the standard format (${registry}${packageNameWithScope}/-/${packageNameWithoutScope}-${packageVersion}.tgz).

It looks like this approach was taken by pnpm, but it led to bugs. Apparently, not all tarballs are served from this URL. Some tarballs have non-standard URLs for some reason (e.g. https://registry.npmjs.org/esprima-fb/-/esprima-fb-3001.0001.0000-dev-harmony-fb.tgz), and others are served from different paths altogether (as demonstrated here). Apparently private registries such as npm Enterprise use different URL path structures.

The solution taken by the pnpm team was to only cache the tarball URL if it differs from the standard format (details here). That solution works fine for most cases, if you don't switch registries. If you do, then you could end up trying to use a generated tarball URL to download from a registry that uses an alternate tarball path.

Ultimately, I don't see any straightforward way to remove the tarball URL from the lockfile without hurting performance. It serves to make the installation process significantly faster if your registry matches the registry used in the cached (which it does for most users, most of the time).

At best, we could omit the tarball URL in certain known cases (i.e. known registries like registry.yarnpkg.com or registry.npmjs.org) can calculate them instead, but only if we had a strategy for dealing with the anomalous cases. That would mean hard-coding behaviors for specific registries, and perhaps letting the user opt-in to using auto-generated URLs (e.g. an artifactory user could ask for URLs to be generated just like they are for npm Enterprise). None of those seem like great options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants