-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cyclic build dependency #11
Comments
Might be related to what #10 was talking about. |
@cuviper Perhaps you could also help since you wrote 9adb885
At no point in this process, do I point any script to the location of any data I downloaded from unicode.org. So I'm confused where the actual source data is coming from. |
Interesting. I didn't really think about build cycles here, but I can at least try to explain what's going on. The cycle doesn't seem problematic to me, but I don't know what your specific policies are here. First and foremost, pretty much every command in Every file generated by As for ucd-generate/scripts/generate.py Lines 17 to 20 in b56d7f3
Finally, the only data files generated by |
If the However, if the tables generated for tests/benchmarks are a problem, then that's a bit more thorny to resolve... |
Additionally, |
I personally prefer using shell scripts for this sort of thing. Python will work in more environments, but it's not clear how useful that it is for a rare code generation script. See also #11 for a bit more discussion.
This brings in the Unicode 12.1.0 update from regex. ucd-generate relies on this for its `dfa` and `regex` sub-commands. See also #11.
Also, another potentially confusing thing here: the script that |
Hm thanks for the info, it will take me some time to review it and see how it affects us. In general though, it may be wise to work towards getting rid of the pattern of "running ucd-generate and then embedding the output as source code" in some other crate (e.g. bstr that we were talking about earlier), because you lose the information about which version of ucd-generate generated that output and hence which version of Unicode you were using. So you have random crate A assuming Unicode 11 whereas random crate B assumes Unicode 12, which might cause very very strange and hard-to-debug bugs. |
Can you suggest an alternative? Pretty mich everything I've seen that uses Unicode tables does stuff like this. In many cases, it's just a Python script, but I wrote ucd-generate to standardize and centralize this stuff for the Rust ecosystem. If I can't do codegen then I'm not sure what else can be done. I also don't understand why this is suddenly an issue now. Is there a new policy? Nothing substantive has changed on my end for at least a year. |
An alternative could be to bundle all possible outputs of Nothing has changed on our end, it's just that whenever a new crate gets uploaded to Debian it has to be reviewed, and this time around the reviewer caught this issue. If the issue already existed in previous versions, it's probably just the case that previous reviewers overlooked it. There's 500+ rust crates in Debian now so it's reasonable that those volunteer reviewers overlooked some issues. We were also somewhat in a rush to bundle cargo for the previous Debian stable release, and this might have contributed to overlooking things like this. |
Yeah... That is not going to work. At all. Not only would all possible combinations of the data be waaaaaay too big, but much of data is purpose built for specific use cases. I also don't quite understand how your suggestion is materially different from the status quo. I mean, there is no way that I can see to break the cycle between regex/regex-syntax and ucd-generate. They feed into each other. Let's try a different tact. Can you explain why this setup is problematic for Debian? I frankly still don't get it. I know you've sent me links to stuff before, but they don't contain any context and are generally opaque to me. Like all that's happening here is translating data from a structured format into code or finite state machines. The result is committed so that builds do not need to regenerate it. There's nothing particularly exotic happening here, and it's pretty darn standard when working with Unicode. ICU does the same stuff, for example. So does the Rust standard library, Python and numerous other things. |
A softer alternative would be to:
Then it would be easy to see if your dependency tree was using different versions of Unicode. On top of that, Cargo would enforce acyclicity.
Well, how did you originally set up the cycle? It must have been not-a-cycle at some point in the past. [*]
The problem is this spaghetti relationship of implicit build-dependencies on unknown and potentially different versions of the Unicode spec. We need to check where the real source of data/code is coming from for copyright reasons, and this is hard when dependencies are not declared explicitly. The cyclicity is a secondary and less important issue. We can hand wave it away as "oh but it's such a small amount of data" but if it's such a small amount, it should also be possible to fix the issue easily due to my [*] point above. |
That seems a bit weird to me. With the exception of the Now, the
This would imply that every compilation of
How do you deal with this in other ecosystems? For example, RE2 just updated their tables to Unicode 12.1, and it effectively uses the same approach as
The Note that other than the
But they aren't build dependencies though. You don't need to generate the Unicode tables to build this stuff, because the Unicode tables are meant to be committed to source control. Other than that quibble, I'm not sure I see the spaghetti here. If what I'm doing is spaghetti, then what everyone else is doing is even worse, because they are just hacked together Python scripts:
If you look carefully at the Python scripts above, you can see that a lot of them clearly share some ancestry. What effectively happened is that the same Python script was copied around to the various Unicode crates, tweaked for a new purpose, and then used to generate code based on Unicode tables. This is exactly how Things got a bit muddled with the addition of the So I'm not sure what to do here. It sounds like what you want is clearer provenance, and it seems to me like there should be much simpler solutions than what you've proposed thus far. Brainstorming:
|
Also, I'd like to repeat that I'm still flying blind here. Is there a document that lays out the policy that the FTP masters are enforcing? (The extent of my ignorance is deep. I don't even know what an "FTP master" is.) |
Why
"committed to source control" is an undeclared dependency, on: A. an unknown version of ucd-generate What you're doing is like building some minified javascript, commiting it to source control, and using it as an opaque piece of data during the build. That will work around technical restrictions around dependencies and ordering, but the FOSS requirements of Debian policy (which are really just moderately-strong auditing requirements derived from legal requirements) still require us to track it, and the solution here in Debian is usually to (1) delete the generated files and (2) re-run the generation command. However this is not possible here due to the cycles. Your example about RE2 is different. We can do both (1) and (2) for RE2 and the output of RE2 is (to my knowledge) not used to build generated "source" code for other packages. Or if it is, I'd guess it's done in a way that could be replaced by any other regex implementation and is not specific to that version of Unicode. That's not the case with ucd-generate/bstr where the FSMs are very Unicode-specific and clearly depend on a particular versions of (A) and (B).
The hodge podge is bad but different from "spaghetti" which refers to cycles of dependencies that need to be manually untangled. I applaud your effort to factor away the hodge podge but it's important not to introduce other issues at the same time...
There is no document that goes into the situation in as much detail as I am doing, so I don't think you'll be convinced by any of the existing documents. "clearer provenance" basically sums it up however. For the purposes of this discussion you can read "FTP master" as "copyright auditor" that you might find in the legal department of a company. HowCyclesYour suggestion to move the
After looking at this code in a bit more detail, it would be more correct to read these names from Are there any more cycles? Wouldn't these two fixes break all of them? Provenance
With the caveat that the default download version simply equals the "new convention" version, this would be a more-or-less equivalent but manual version of the Re-generating in Provenance doesn't just mean "it is potentially theoretically possible for a human to spent a long time to figure out how everything was built", it means being able to do this cheaply and quickly, and that means being able to run it in CI automatically. edit (addenum): Since we're making |
That's correct. It will be mildly annoying to do this, but if it solves your problem I'm happy to do it. It also probably makes more sense in the grand scheme of things. The current situation was a marriage of convenience, since Note though that
Reading from I'll instead probably just change
I believe so, yes. So long as "cycle" here refers to a conceptual cycle. There will still be the cycle that
I don't see much reason to put this stuff into
Right, that makes sense. Why isn't a combination of shell scripts and better documentation enough for this? Running it in CI is also something I could potentially get on board with, but sometimes the codegen takes a long time, and I'm not sure what I'd do with the result. I suppose I could just run the codegen scripts and assert that there is no error. Maybe something more elaborate would diff the output with the pre-committed files.
Yes it does. We're talking minutes here. From the
Not everything takes this long. Simple Unicode tables are generally pretty quick, but it's still on the order of a few seconds. In addition to needing to compile
I grant that this sounds nice, and if I were otherwise okay with the |
The conceptual dependency already exists whether it's undeclared or declared, all I'm saying is there should at least exist an option (via features) to build it in the explicitly declared form. Cargo can support dependency chains like
Debian runs all builds in CI, and for generated code like
This is why I suggested using features to gate what |
What I'm hearing is that hooking ucd-generate up to build dependencies is a nice-to-have and not a requirement, so I'm not going to pursue that. There are tons of crates in the ecosystem that use code generation like this, even beyond Unicode stuff. The encoding and encoding_rs crates are examples. It seems to me like if you really need to be able to regenerate code like this, then you're going to need a bespoke system to do it anyway, unless you successfully convince everyone in the entire Rust ecosystem to do code generation through |
And then there are probably oodles more examples that use bindgen to generate Rust ffi bindings for C code. bindgen can be hooked up to run in build.rs, but plenty of people don't do that, inciuding myself. For example, the pcre2 crate has code generation from bindgen committed to the repo. |
These are all anti-patterns that make life harder for packagers that are trying to follow provenance best practises, and effectively "cheats" the crate into passing reproducible builds. Yes we are trying convince everyone in the entire Rust ecosystem to do code generation and everything-else-generation through It is a requirement for Debian policy and if the crate does not do it, it makes our lives harder and the crate slower to package for Debian. Due to limited volunteer resources, sometimes stuff in breach of this makes its way into the Debian archive, but this should not be interpreted as "Debian policy allows it". |
This whole ordeal has just been super frustrating. We obviously started with different definitions of what is "principled." Moreover, the entire Debian policy here has been terribly opaque to me, with requirements and nice-to-haves and processes being completely unclear. A lot of this stuff seems like it is just coming from out of thin air. For my part, other than the cycle issues, I generally think I am being principled about this. Plenty of companies have used my Rust code, so it has presumably passed through their legal departments without issue since this is the first I'm hearing of this provenance issue. So from my perspective, this is looking to me like Debian has a terribly inconvenient policy. I don't think you get to hide behind "that's what a good FOSS project does." There are plenty of good FOSS projects that don't do this. What you're asking for is both a lot of work and a complication in how things get built, in addition to long standing bugs in Cargo that would likely impact your suggested solution. I understand why doing this in build.rs would be nicer for you. I can appreciate why it is more principled, and even agree with that perspective. But principles are not the only thing that matters. |
Everything I've said so far are very natural derivations from (1) the definition that source code == preferred form of modification, and (2) "FOSS" refers to source code, not generated code. This is hardly "opaque" or "hiding". This for example is why the GPL explicitly refers to build scripts, because it would be a loophole if a user can look at the purported source code but can't easily build it into something executable. The Debian project is indeed more strict than other projects about it, including companies with legal departments. I am at a loss to why you think the obvious and natural suggestion that "generated code be actually-generated" by the end user for FOSS verification purposes "comes out of thin air". I don't know what you exactly mean by "plenty of good FOSS projects that don't do this". A project can have world-class software engineering and design in its code and core functionality, and still make mistakes in how it designs its build process especially when it comes to adhering to FOSS principles. It was (e.g.) a common mistake to check in autogenerated What I am suggesting is not that much work. I can submit a patch once the other stuff you mentioned previously is done, and you can see how much it is. This is a Debian "advice" document that explicitly mentions what I am saying, but the same ideas are repeated in slightly different forms in many other places. |
Thanks for linking to something. That helps.
As I've stated a few times, I am not opposed to better provenance tracking. I am simply strongly skeptical that adding build dependencies for core crates in the ecosystem on something that was built as a tool for maintainers (that is, On top of all of that, aside from the cycle problem, I don't see how the status quo is in any way contradictory of the policies you've described. It sounds to me like your I do appreciate your offer to submit a patch, but I not only think you're under-estimating the amount of work involved here (building the So basically, when I take a step back from all of this and look at it in the grander scheme of things, then we have on the one hand this (IMO, complex/bloated, somewhat of an impedance mismatch) choice to add the code generation step to I think the other part of the problem I've been having here is probably somewhat emotionally irrational, but it's of the form, "why are you picking on me now, when all these other examples of software packaged by Debian didn't have to go through this." I understand why, as you explained, those other pieces of software just slipped through the review process. But now I'm at a point where it seems like this policy isn't a strict requirement, and it feels to me like I'm having significant extra work foisted on to me because of a Debian policy that I largely think is unnecessary. Namely, for my projects, I do try to make sure that all auto-generated files can be re-generated. And in cases where I've failed there, I'm happy to take some steps to improve that. So there's no essential conflict in the values we share here; we're just differing on the extent to which this process is baked into tools. |
Also, AIUI, Cargo does not support binary dependencies, so using |
Thanks for looking into all of those details. They are indeed heavier than what I originally imagined with a
Not just easier but more transparent - when you're looking after 500 rust crates it is hard to manually review every one to see if it is using ad-hoc shell scripts as part of the build or not. If
To be clear, the second option does not directly satisfy the policy as-is, the Debian packager will have to add a crate-specific tweak to the Debian build, that deletes the generated code then calls the shell script to regenerate it. If you establish a convention for doing so, then this tweak could be automated on the Debian side. As for this remaining |
I personally prefer using shell scripts for this sort of thing. Python will work in more environments, but it's not clear how useful that it is for a rare code generation script. See also BurntSushi#11 for a bit more discussion.
This brings in the Unicode 12.1.0 update from regex. ucd-generate relies on this for its `dfa` and `regex` sub-commands. See also BurntSushi#11.
These have been moved to regex-cli and now use regex-automata 0.3: https://github.com/rust-lang/regex/blob/master/regex-cli/README.md#example-serialize-a-dfa This also breaks the cyclic dependency where updating to a new Unicode version for bstr required the following: * Run ucd-generate to update regex-syntax tables. * Publish new regex-syntax. * Update ucd-generate lockfile to bring in new regex-syntax. * Build new ucd-generate binary. * Run ucd-generate to update bstr regexes. Namely, that last step requires updating regex-syntax in order to propagate the Unicode updates into the regex engine. The new process is: * Run ucd-generate to update regex-syntax tables. * Build regex-cli (also in the regex crate repo). * Run regex-cli to update bstr regexes. So now we don't have to do this weird dance where we loop back around to build a new version of ucd-generate. ucd-generate does still depend on `regex` at the moment via `ucd-parse`, but this doesn't need updating when a new version of Unicode comes out. Still, I'm going to explore breaking that dependency as well via `regex-lite`. ucd-generate also still depends on `ucd-util` which also has Unicode data embedded into it. I'm going to look into fixing that by requiring the caller to pass in the data tables. Fixes #11
This breaks a dependency where `ucd-util` dependend on running ucd-generate to produce the JAMO_SHORT_NAME table. Instead, we now require the caller to provide the table. Fixes #11
These have been moved to regex-cli and now use regex-automata 0.3: https://github.com/rust-lang/regex/blob/master/regex-cli/README.md#example-serialize-a-dfa This also breaks the cyclic dependency where updating to a new Unicode version for bstr required the following: * Run ucd-generate to update regex-syntax tables. * Publish new regex-syntax. * Update ucd-generate lockfile to bring in new regex-syntax. * Build new ucd-generate binary. * Run ucd-generate to update bstr regexes. Namely, that last step requires updating regex-syntax in order to propagate the Unicode updates into the regex engine. The new process is: * Run ucd-generate to update regex-syntax tables. * Build regex-cli (also in the regex crate repo). * Run regex-cli to update bstr regexes. So now we don't have to do this weird dance where we loop back around to build a new version of ucd-generate. ucd-generate does still depend on `regex` at the moment via `ucd-parse`, but this doesn't need updating when a new version of Unicode comes out. Still, I'm going to explore breaking that dependency as well via `regex-lite`. ucd-generate also still depends on `ucd-util` which also has Unicode data embedded into it. I'm going to look into fixing that by requiring the caller to pass in the data tables. Fixes #11
This breaks a dependency where `ucd-util` dependend on running ucd-generate to produce the JAMO_SHORT_NAME table. Instead, we now require the caller to provide the table. Fixes #11
Hi, can you please clarify why ucd-util contains files generated by ucd-generate, but then ucd-generate depends on ucd-util? This makes it hard for us to package things in Debian where the FTP masters are being very strict about where data (and "source" code) is ultimately coming from.
Presumably the data in ucd-util was generated using some particular version of Unicode, which locks ucd-generate to that particular version of Unicode?
The text was updated successfully, but these errors were encountered: