-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support *-sys packages larger than 10MB (somehow) #40
Comments
I thought about this a little when I was working out how to get binary dependencies for a Rust project. In the end, I decided that what the build script should try (note: this was in Python, before Cargo had build scripts):
I've always felt that just compiling the source is dicey as Windows doesn't have a C compiler by default. Since Rust no longer depends on GCC, you can't even assume that is present on Windows. Besides which, it basically ignores any version installed on the system, which might cause surprising behaviour ("but, I updated libsplang on my system to close the security vulnerability; how'd I get exploited?!", or "why can't prog-a and prog-b share files? They're both using libsplang!"). It might be worth having a standard |
Another possible route here would be to compress with |
@DanielKeep: I'd use system packages for cld2, but it's not a very widely-packaged library. Plus, I need a build solution for Heroku, where I have no control over the installed libraries. lifthrasiir has just sent me emk/rust-cld2#1 , which removes cld2's documentation, deletes some unused data tables, and strips comments from the source code (which substantially boosts compression performance). This gets the rust-cld2 crate under 10MB, at least for this version, though the recent update to the upstream project may break it. Is there any way to run a custom script during the packaging process? If not, maybe I need to fork cld2 and produce a stripped down git repo. Or cache tarballs on S3, but I'm trying to avoid that. I'd love to find a good solution here. |
@alexcrichton If the crate has a data which inherent entropy exceeds 10MB, we are left with no choice but workarounds. In the particular case of cld2, the main source of excess entropy is a comment (with UTF-8-encoded words for each entry) and removing comments really helps, but the table itself already exceeds 10MB and no common general purpose compresser can easily pack them. (My estimate is that, the actual entropy is some 7 or 8MB, as about 40% of data can be somewhat correlated to each other. But it wouldn't be very easy to infer.) |
@lifthrasiir we've got to draw the line somewhere in terms of package upload or otherwise it'll get out of hand. Some crates will always fall on the other side of the line (and this may for example). |
Yeah, I can see there's an obvious tension between:
Then there are the semi-evil solutions, including breaking cld2 up into multiple packages by language detected, or some such. I'm going to try to figure out how these tables fit together, and see if I can find a clever solution. |
Using @lifthrasiir's well-researched patch as a starting point, I've created a new git mirror of the upstream cld2 repository, stripped the comments as proposed, and built an There are bunch of table files which aren't getting included in the current build, and I'll need to look into those later. So maybe we'll see this probem again in the future. But at least for now, for this one package, we appear to have a workable solution. Thank you to everybody who helped out, especially to @lifthrasiir for figuring out how to cut down the package size. |
With the change I just merged, please contact help@crates.io to get the maximum size increased individually for a particular crate. |
Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: Renovate Bot <bot@renovateapp.com>
(Continuing a discussion started here.)
The
cld2
library is a natural-language detection library from Google, and it does some pretty cool stuff. I've packaged it as two Rust libraries,cld2
andcld2-sys
. But because the upstreamcld2
library is packaged by very few Linux distributions, I've chosen to distribute the source code with thecld2-sys
package and build it using the Rustgcc
library. So far, so good—all this works quite nicely.But I can't upload the package to crates.io because it contains statistical language models, and those models are just too big:
I can shrink this down somewhat (by omitting everything I don't need for the build), but I almost certainly can't get it under the 10MB limit. I can think of a couple of ways to address this issue:
*-sys
packages will be larger than 10MB, and provide some way to override the limit selectively.build.rs
to download it. But this introduces a dependency on an outside data source that may go away.Any thoughts on the best way to handle this? Thank you for your advice, and for a great package-management system!
The text was updated successfully, but these errors were encountered: