-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasource/julia-pkg-server): add Julia PkgServer
#29623
base: main
Are you sure you want to change the base?
feat(datasource/julia-pkg-server): add Julia PkgServer
#29623
Conversation
`PkgServer`s are the primary package distribution system for Julia (the alternative being operating on the underlying Git repositories directly). This datasource enables retrieval of packages from registries through these `PkgServer`s. It supports multiple registries across `PkgServer`s. Julia-specific versioning is not yet implemented, this will be implemented as a follow-up. References renovatebot#6508.
PkgServer
datasourcePkgServer
This behavior is already tested at the `retrieveRegistryState` level. It was left over from when this was implemented directly at the datasource level.
lib/modules/datasource/julia-pkg-server/__fixtures__/ExtraHTTP.tar
Outdated
Show resolved
Hide resolved
// TODO: The tests for this functionality only require a single "chunk" to | ||
// be read from the tarball to succeed. Hence, it would be sufficient to | ||
// just process a single chunk directly as a string and return it. This | ||
// does not hold in general and a (minimal) test requiring multiple chunks | ||
// should be implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a tarball where this fails (which was obtained from the official PkgServer
), but that tarball is over 7MB large and I did not want to include that as a fixture. I have tried to create a more minimal example where the simpler approach fails, but have so far been unable to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often do we need to fetch tarballs, if they can be large? And for how long can we cache them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very good question and one I'm not entirely sure about myself.
The tarballs depend on the state of the registry, so they change whenever a new package is published to a registry. This is done by issuing a PR to the repository hosting the registry. So a suitable cache duration highly depends on the expected rate of package releases, when the corresponding PRs are merged, and how soon new releases should be available to Renovate. Given that these variables will very much depend on the registry from which dependencies are pulled, I'm not entirely sure there will be a uniformly suitable answer.
That being said, the General
registry (which is where all public, open-source packages tend to be published) sees a couple of releases per hour and "automatically merges" suitable PRs every 8 minutes. So at a minimum a cache duration of 8 minutes can be used, although in practice it is probably fine to cache these up to an hour, maybe even a bit longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a single tarball per registry and not per package then I think something like a 15-30 minute cache time is fine. If it was common for the tarball to be unchanged for longer periods of time then we'd want to retain and use etag or last-modified headers for subsequent requests, but if registries typically get updated frequently then there's little gain from that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a single tarball per registry. I believe the default ttlMinutes
is 30 minutes, so I'll leave this at that.
PkgServer
s support multiple "flavors" where "conservative" is the alternative currently available. That tends to lag by a couple of hours and used to be the default mode. But that used to cause quite some complaints from the community related to CI specifically, see JuliaPackaging/PkgServer.jl#144 for more context, so I don't think we'd want to support that. Once we do I think we could update the caching strategy accordingly.
const tarballParser = new tarParse({ | ||
filter: (path) => pathsToExtract.includes(path), | ||
}); | ||
|
||
tarballParser.on('end', () => { | ||
if (Object.values(fileContentChunks).every(({ length }) => length > 0)) { | ||
const fileContents: FileContents<string> = {}; | ||
for (const path of pathsToExtract) { | ||
fileContents[path] = Buffer.concat(fileContentChunks[path]).toString( | ||
'utf-8', | ||
); | ||
} | ||
|
||
resolve(fileContents); | ||
} else { | ||
resolve(null); | ||
} | ||
}); | ||
|
||
tarballParser.on('entry', (entry) => { | ||
entry.on('data', (data) => { | ||
fileContentChunks[entry.path].push(data); | ||
}); | ||
|
||
entry.on('error', reject); | ||
}); | ||
|
||
tarballParser.end(tarball); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This extracts the necessary files in memory. It might be more straightforward to unpack do disk instead and work with some form of caching. I'd be happy to change it if that's better suited to Renovate's overall design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends on how big these tarballs can be. ~10 MB should be ok, but bigger files can cause memory issues (oom)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracting the files should work in an asynchronous, streaming fashion with this implementation so memory should not be an issue here. I have seen this same pattern suggested for files that ran into the gigabytes of data as well, for instance here which uses the slightly higher level list
functionality. However, when downloading the files are kept in memory, so the OOM would likely occur there. So the concern is still valid.
The tarballs for the largest registry currently come in at 7.7MB, which is years' worth of data but close to the ~10MB mark. I don't have statistics at hand for how fast this grows, but it's probably good for another couple of years at least. That being said, it probably doesn't hurt to prepare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rarkins we need to think about those big blobs. we cache every request in memory, so we've this blob twice.
renovate/lib/util/http/index.ts
Line 186 in 26337ac
(options.method === 'get' || options.method === 'head') |
we probably should pass memCache:false
for those files and only cache the smaller transformed data to package cache.
@bauglir I think we should extract all deps from that file and add them to the package cache, so we can speed up additional lookups. we do something similar at rubygems
datasource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. I did a quick test and it looks like parsing everything and putting that in an object results in ~800kB of memory being used (checked by capturing a heap snapshot after parsing all files in the tarball). So that should save some memory.
I'll take a stab at this. My plan would be something along the following lines. It looks like this can be a bit more straightforward than the rubygems
use case as it doesn't have to hit multiple endpoints, etc. Am I correct in assuming that using the @cache
decorator on the getReleases
method and then populating the cache with the same key with all the packages whenever that method gets executed would be sufficient (and disabling caching for the downloaded tarballs)?
One thing I noticed in a quick experiment is that if I take this approach and Renovate needs to update multiple packages that it checks all of the @cache
decorators before doing any getReleases
calls. This then results in none of the caches being there yet (on a fresh cache or after the TTL has expired) and the full retrieval being done still occurring multiple times. Subsequent runs then correctly hit the cache. Can I not use the @cache
decorator this way and should I do all the cache management directly within getReleases
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my changes are for the docs or code comments.
In general the changes are:
- One sentence per line for docs files.
- Simpler words and phrases.
- Shorter sentences.
- Fix some typos.
This matches the Renovate Style Guide.
I'll let the maintainers do a technical review of your code and tests. 😉
* The final part of the path of a (fully specified) registry URL references | ||
* the tree SHA of the state of of the registry repository the package server | ||
* is serving. Given the content-addressable nature of this SHA, it can be used | ||
* as a caching mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The final part of the path of a (fully specified) registry URL references | |
* the tree SHA of the state of of the registry repository the package server | |
* is serving. Given the content-addressable nature of this SHA, it can be used | |
* as a caching mechanism. | |
* The final part of the path of a (fully specified) registry URL references | |
* the tree SHA-1 of the state of of the registry repository the package server | |
* is serving. Given the content-addressable nature of this SHA-1, it can be used | |
* as a caching mechanism. |
I think Git still defaults to SHA-1 hashes? I think there's experimental support for SHA-256 now. Not sure if we need to be specific about the SHA-1/SHA-256/SHA-512 hashes here.
What happens if the datasource queries a repository that uses SHA-256 hashes? If we do not support those newer hashes, we should probably test for that, and show a error to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PkgServer
implementation does not take this difference into account currently, it only supports SHA-1 states. Given that Renovate queries registries through the package server, I would expect that this would initially get solved at the package server level.
If a SHA-256 "state" would be added to a URL it would fail to parse and be reported here.
Once package servers start supporting registry repositories using SHA-256 they will start reporting those URLs through their /registries
endpoint and currently this is indeed not tested for.
I think this raises a good question as to whether we should be enforcing strict formatting on the state. Either we should probably check it everywhere, or we can forego the check altogether which will then just result in 404 errors when retrieving the registry with the upside that it would not require changes to Renovate if the package server starts supporting different state formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check input matches expectations
I think this raises a good question as to whether we should be enforcing strict formatting on the state.
It's probably best to to check what Renovate "pulls in". So in general terms:
- Check input is valid
- Check input matches expected format
- Sanitize dangerous stuff
- Only then proceed with processing the input
SHA state check as general utility/function to be used by other Renovate code?
Checking if the state is SHA-1 vs SHA-256 vs SHA-512 sounds like something that other Renovate code could re-use? Maybe that should be a new file/function in a suitable place?
To check or not to check?
I vote check
Either we should probably check it everywhere, or we can forego the check altogether
I would vote check everywhere, but I'm not a maintainer. 😄
If we decide to not check, show good error message
... or we can forego the check altogether which will then just result in 404 errors when retrieving the registry ...
I guess that would look like this then?
- Renovate tries to create PR for SHA-256 registry package.
- Renovate shows generic 404 error.
- User thinks Renovate is broken.
I would prefer something like this:
- Renovate tries to create PR for SHA-256 registry package.
- Renovate shows 404 error, and explains that the SHA-256 bit is not supported yet.
- This probably prompts the user to ask the Renovate maintainers to add code to support the new hash.
Your upside is my downside
... with the upside that it would not require changes to Renovate if the package server starts supporting different state formats.
I don't agree with this. If the package server supports a new format, we probably need to account for that in the Renovate code. So it's best if we state explicitly what's supported, what's not, and have a way to get nudged when things break.
Maintainers make the decision here
I'll let the maintainers decide what to do with all this info above, and with your idea! 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we simply allow sha256? so we're future prove when the package server becomes support for it? I think in near future more and more new repos will use sha256 by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extensive input.
The state verification currently only applies to user-provided input. Given that a package server will directly report what URLs to use when asking it for the state of a registry, the URLs/states it provides are assumed to be correct (note that the package server hosts these URLs itself). If the package server turns out to provide wrong URLs that will indeed result in 404s.
Currently, no validation is done on those returned URLs (but I do have a note locally on whether this is something we'd want to do which I think is what you are suggesting). So I'm open to changing that.
can we simply allow sha256?
I have added this in 1ffa7ae.
// TODO: The tests for this functionality only require a single "chunk" to | ||
// be read from the tarball to succeed. Hence, it would be sufficient to | ||
// just process a single chunk directly as a string and return it. This | ||
// does not hold in general and a (minimal) test requiring multiple chunks | ||
// should be implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often do we need to fetch tarballs, if they can be large? And for how long can we cache them?
Co-authored-by: HonkingGoose <34918129+HonkingGoose@users.noreply.github.com> Co-authored-by: Rhys Arkins <rhys@arkins.net>
This uses `re2` under the hood which prevents certain DoS attacks. Co-authored-by: Rhys Arkins <rhys@arkins.net>
Co-authored-by: HonkingGoose <34918129+HonkingGoose@users.noreply.github.com>
…caching package releases Co-authored-by: Rhys Arkins <rhys@arkins.net>
Forgot to add in 455032c.
…r class for `re2`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there no way to only download the required files instead of the full tarballs? 🤔
const tarballParser = new tarParse({ | ||
filter: (path) => pathsToExtract.includes(path), | ||
}); | ||
|
||
tarballParser.on('end', () => { | ||
if (Object.values(fileContentChunks).every(({ length }) => length > 0)) { | ||
const fileContents: FileContents<string> = {}; | ||
for (const path of pathsToExtract) { | ||
fileContents[path] = Buffer.concat(fileContentChunks[path]).toString( | ||
'utf-8', | ||
); | ||
} | ||
|
||
resolve(fileContents); | ||
} else { | ||
resolve(null); | ||
} | ||
}); | ||
|
||
tarballParser.on('entry', (entry) => { | ||
entry.on('data', (data) => { | ||
fileContentChunks[entry.path].push(data); | ||
}); | ||
|
||
entry.on('error', reject); | ||
}); | ||
|
||
tarballParser.end(tarball); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends on how big these tarballs can be. ~10 MB should be ok, but bigger files can cause memory issues (oom)
…ng the default registry Co-authored-by: Michael Kriese <michael.kriese@visualon.de>
Co-authored-by: Michael Kriese <michael.kriese@visualon.de>
Unfortunately not 😞 At least not when relying on package servers. That would require direct interaction with the underlying Git repositories. This is an alternative datasource that I was planning on creating as well, but that supports different use cases (e.g. package servers enable access to registries to which one does not have access to the underlying repository of a registry) and this one seemed more straightforward to implement. I'm open to starting a discussion around whether package servers could start to provide direct access to these files, but that will likely be a long discussion and might not be possible given the design of the package server infrastructure. |
This prevents binary data from having to be stored in the repository at the cost of a slightly higher test runtime. It makes it easier to inspect files that end up in the tarballs when testing.
we should start that discussion on the package server repo, so they hopefully consider it. it would be a huge win for renovate. |
lib/modules/datasource/julia-pkg-server/__fixtures__/ExtraHTTP/H/HTTP/Package.toml
Outdated
Show resolved
Hide resolved
const tarballParser = new tarParse({ | ||
filter: (path) => pathsToExtract.includes(path), | ||
}); | ||
|
||
tarballParser.on('end', () => { | ||
if (Object.values(fileContentChunks).every(({ length }) => length > 0)) { | ||
const fileContents: FileContents<string> = {}; | ||
for (const path of pathsToExtract) { | ||
fileContents[path] = Buffer.concat(fileContentChunks[path]).toString( | ||
'utf-8', | ||
); | ||
} | ||
|
||
resolve(fileContents); | ||
} else { | ||
resolve(null); | ||
} | ||
}); | ||
|
||
tarballParser.on('entry', (entry) => { | ||
entry.on('data', (data) => { | ||
fileContentChunks[entry.path].push(data); | ||
}); | ||
|
||
entry.on('error', reject); | ||
}); | ||
|
||
tarballParser.end(tarball); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rarkins we need to think about those big blobs. we cache every request in memory, so we've this blob twice.
renovate/lib/util/http/index.ts
Line 186 in 26337ac
(options.method === 'get' || options.method === 'head') |
we probably should pass memCache:false
for those files and only cache the smaller transformed data to package cache.
@bauglir I think we should extract all deps from that file and add them to the package cache, so we can speed up additional lookups. we do something similar at rubygems
datasource.
…ion directly The datasource ID is defined in a central location to prevent circular dependencies when referring to it through the `JuliaPkgServerDatasource` class. For consistency it should primarily be referred to in this way.
…egistry URLs This consolidates the knowledge about the format of registry URLs a bit better in the codebase making it easier to change the format if necessary. It primarily affects tests.
Primarily removes some duplication in tests while ensuring consistency.
…stry URLs This is not currently supported by `PkgServer` (which assumes SHA-1). This supported is added for future-proofing, given that the underlying Git repositories may start to use SHA-256 instead to encode tree SHAs.
…failed HTTP requests Failed HTTP requests are also logged at the generic datasource level and from the base class. These additional warnings do not add much value.
Co-authored-by: Michael Kriese <michael.kriese@visualon.de>
I have started the discussion in JuliaPackaging/PkgServer.jl#210 and will be happy to follow up on simplifying things once/if that gets resolved. |
Seems like no progress in the upstream issue |
I don't think we are. I also suspect it to be likely that even if the ask upstream would be resolved, that it'd still be a while. This has been primarily stuck on me. I got stuck a bit and haven't had the bandwidth over the last month or two to get unstuck. But I was planning on getting back to this over the next week or two. I'm sorry about the delay.
I have already done some experiments here and I'm relatively sure I can make this work. Some rudimental tests seem to suggest a much smaller memory/storage footprint implementing a cache for just the absolutely necessary information. Basically I'm just doing the extraction once and manually caching just the package version information instead of the tarballs and network requests. I did run into one situation I'll probably need some help with, but I can follow up with that on Slack. The gist of the issue is that the current way I have implemented the cache works properly for subsequent runs of Renovate, but if multiple packages need to be updated the full cache update mechanism gets triggered for every package. |
Changes
PkgServer
s are the primary package distribution system for Julia (the alternative being operating on the underlying Git repositories directly, which I might contribute later but seemed like a more daunting task given the number of possible hosting platforms). This datasource enables retrieval of packages from registries through thesePkgServer
s. It supports multiple registries acrossPkgServer
s (a single server can host multiple registries and the same registry can be hosted across differentPkgServer
s).Julia-specific versioning is not yet implemented, I will implement that as a follow-up.
Context
Documentation
How I've tested my work
I have verified these changes via: