Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cyclic build dependency #11

Closed
infinity0 opened this issue Jul 20, 2019 · 24 comments · Fixed by #56
Closed

Cyclic build dependency #11

infinity0 opened this issue Jul 20, 2019 · 24 comments · Fixed by #56

Comments

@infinity0
Copy link

infinity0 commented Jul 20, 2019

Hi, can you please clarify why ucd-util contains files generated by ucd-generate, but then ucd-generate depends on ucd-util? This makes it hard for us to package things in Debian where the FTP masters are being very strict about where data (and "source" code) is ultimately coming from.

Presumably the data in ucd-util was generated using some particular version of Unicode, which locks ucd-generate to that particular version of Unicode?

@infinity0
Copy link
Author

Might be related to what #10 was talking about.

@infinity0
Copy link
Author

@cuviper Perhaps you could also help since you wrote 9adb885

  • ucd-generate (the top-level crate) depends on the lower crates ucd-*
  • these lower crates ucd-* contains autogenerated files, which are generated by scripts/generate.py, which calls ucd-generate, completing the build-dependency cycle

At no point in this process, do I point any script to the location of any data I downloaded from unicode.org. So I'm confused where the actual source data is coming from.

@BurntSushi
Copy link
Owner

Interesting. I didn't really think about build cycles here, but I can at least try to explain what's going on. The cycle doesn't seem problematic to me, but I don't know what your specific policies are here.

First and foremost, pretty much every command in ucd-generate takes a Unicode data directory as an argument. The README has an example, which shows the "download the data" step: https://github.com/BurntSushi/ucd-generate#example The data comes straight from unicode.org and ucd-generate slurps it up as-is. (Note that there are some poor failure modes around this that I've been meaning to fix, see #9.)

Every file generated by ucd-generate contains a header with the command used to generate that file. For example: https://github.com/rust-lang/regex/blob/db27dcb38e4445131a4824d4a8c84c7d5ccdaa21/regex-syntax/src/unicode_tables/script.rs#L3

As for scripts/generate.py, I forgot that even existed. I'd probably prefer that written as a shell script, which I'll just try to do now, since it'd be good to upgrade everything to Unicode 12. As for the data, it looks like the UCD data dir is given as an argument to that script:

p = argparse.ArgumentParser()
p.add_argument('ucd', metavar='DIR', nargs=1)
args = p.parse_args()
ucd = args.ucd[0]

Finally, the only data files generated by ucd-generate that are used by the ucd-* crates is the tiny JAMO_SHORT_NAME table, which is used by ucd-util to implement name generation of Hangul codepoints. There are more tables in the ucd-* crate source code, but all of them are for either benchmarks or tests.

@BurntSushi
Copy link
Owner

If the JAMO_SHORT_NAME table is a serious problem, then I could issue a breaking change release to ucd-util (not a big deal) that changes the API to require the caller to provide the table instead of baking it in.

However, if the tables generated for tests/benchmarks are a problem, then that's a bit more thorny to resolve...

@BurntSushi
Copy link
Owner

Additionally, ucd-generate depends on regex, which depends on regex-syntax, which in turn depends on output from ucd-generate.

BurntSushi added a commit that referenced this issue Jul 21, 2019
I personally prefer using shell scripts for this sort of thing. Python
will work in more environments, but it's not clear how useful that it is
for a rare code generation script.

See also #11 for a bit more discussion.
BurntSushi added a commit that referenced this issue Jul 21, 2019
This brings in the Unicode 12.1.0 update from regex. ucd-generate relies
on this for its `dfa` and `regex` sub-commands.

See also #11.
@BurntSushi
Copy link
Owner

Also, another potentially confusing thing here: the script that bstr uses doesn't actually write any Unicode tables. Instead, it writes Unicode finite state machines, which are generate by regexes. The simple regexes are embedded in that script, and those regexes in turn depend on the Unicode data tables from regex-syntax. The more complicated regexes are in separate files and derived from the corresponding Unicode Annex (UAX #29 for grapheme/word/sentence segmenters).

@infinity0
Copy link
Author

Hm thanks for the info, it will take me some time to review it and see how it affects us.

In general though, it may be wise to work towards getting rid of the pattern of "running ucd-generate and then embedding the output as source code" in some other crate (e.g. bstr that we were talking about earlier), because you lose the information about which version of ucd-generate generated that output and hence which version of Unicode you were using. So you have random crate A assuming Unicode 11 whereas random crate B assumes Unicode 12, which might cause very very strange and hard-to-debug bugs.

@BurntSushi
Copy link
Owner

Can you suggest an alternative? Pretty mich everything I've seen that uses Unicode tables does stuff like this. In many cases, it's just a Python script, but I wrote ucd-generate to standardize and centralize this stuff for the Rust ecosystem. If I can't do codegen then I'm not sure what else can be done.

I also don't understand why this is suddenly an issue now. Is there a new policy? Nothing substantive has changed on my end for at least a year.

@infinity0
Copy link
Author

An alternative could be to bundle all possible outputs of ucd-generate as a big blob of data into a crate that also has an API to access or query this data, or to perform common operations on it. The crate's version could then mirror the version of Unicode it is following. ucd-generate could remain as a private tool invoked to build this data, but all public access via other consumers would be through that API. bstr and other crates would then depend on this crate via a normal cargo dependency rather than an implicit developer-time dependency. (Though I'm not sure how feasible this is, perhaps the data would be too big.)

Nothing has changed on our end, it's just that whenever a new crate gets uploaded to Debian it has to be reviewed, and this time around the reviewer caught this issue. If the issue already existed in previous versions, it's probably just the case that previous reviewers overlooked it. There's 500+ rust crates in Debian now so it's reasonable that those volunteer reviewers overlooked some issues. We were also somewhat in a rush to bundle cargo for the previous Debian stable release, and this might have contributed to overlooking things like this.

@BurntSushi
Copy link
Owner

BurntSushi commented Jul 21, 2019

Yeah... That is not going to work. At all. Not only would all possible combinations of the data be waaaaaay too big, but much of data is purpose built for specific use cases.

I also don't quite understand how your suggestion is materially different from the status quo. I mean, there is no way that I can see to break the cycle between regex/regex-syntax and ucd-generate. They feed into each other.

Let's try a different tact. Can you explain why this setup is problematic for Debian? I frankly still don't get it. I know you've sent me links to stuff before, but they don't contain any context and are generally opaque to me. Like all that's happening here is translating data from a structured format into code or finite state machines. The result is committed so that builds do not need to regenerate it. There's nothing particularly exotic happening here, and it's pretty darn standard when working with Unicode. ICU does the same stuff, for example. So does the Rust standard library, Python and numerous other things.

@infinity0
Copy link
Author

A softer alternative would be to:

  1. tie a particular version of ucd-generate to a particular unicode version (or version of source data)
  2. recommend that consumers run ucd-generate in build.rs and declare the specific version as a [build-dependency]

Then it would be easy to see if your dependency tree was using different versions of Unicode. On top of that, Cargo would enforce acyclicity.

I mean, there is no way that I can see to break the cycle between regex/regex-syntax and ucd-generate. They feed into each other.

Well, how did you originally set up the cycle? It must have been not-a-cycle at some point in the past. [*]

Can you explain why this setup is problematic for Debian? [..] The result is committed so that builds do not need to regenerate it. [..] ICU does the same stuff, [..]

The problem is this spaghetti relationship of implicit build-dependencies on unknown and potentially different versions of the Unicode spec. We need to check where the real source of data/code is coming from for copyright reasons, and this is hard when dependencies are not declared explicitly. The cyclicity is a secondary and less important issue. We can hand wave it away as "oh but it's such a small amount of data" but if it's such a small amount, it should also be possible to fix the issue easily due to my [*] point above.

@BurntSushi
Copy link
Owner

tie a particular version of ucd-generate to a particular unicode version (or version of source data)

That seems a bit weird to me. With the exception of the dfa and regex sub-commands, ucd-generate is generally agnostic about which Unicode version it uses. You can still feed it the UCD data directory from Unicode 10.0.0 for example, and it will generate tables for that version of Unicode. All ucd-generate does is take the Unicode data directory as input, parse it, transform it and print it as Rust code.

Now, the dfa and regex sub-commands are slightly different because they are they only sub-commands that don't take a UCD data directory. Instead, they rely on whatever Unicode version is being used by the regex-automata crate, which is in turn whatever Unicode version is used by the regex-syntax crate. I added these sub-commands to ucd-generate because it was the most natural place for them to go, but if the presence of these sub-commands makes auditing provenance harder, then I can move them to their own dedicated tool in the regex-automata repository.

recommend that consumers run ucd-generate in build.rs and declare the specific version as a [build-dependency]

This would imply that every compilation of regex would also need ucd-generate installed (or build it itself) and it would need to re-generate the tables every time it was compiled. That doesn't seem much (if any) softer to me. It's a non-starter. :-/

Then it would be easy to see if your dependency tree was using different versions of Unicode.

How do you deal with this in other ecosystems? For example, RE2 just updated their tables to Unicode 12.1, and it effectively uses the same approach as regex does. The only difference is that it uses its own homegrown special purpose code generator written in Python, and it will download the Unicode data for you.

Well, how did you originally set up the cycle? It must have been not-a-cycle at some point in the past.

The regex-syntax crate used to have its own custom Python script that built its Unicode tables, not unlike how RE2 does it in the aforementioned link. So that's how the cycle was setup, since regex-syntax originally did not rely on ucd-generate. ucd-generate came later.

Note that other than the dfa and regex commands, ucd-generate does not, AFAIK, rely on any particular Unicode version established by the regex-syntax crate. It just uses it for simple parsing of UCD data.

The problem is this spaghetti relationship of implicit build-dependencies on unknown and potentially different versions of the Unicode spec. We need to check where the real source of data/code is coming from for copyright reasons, and this is hard when dependencies are not declared explicitly. The cyclicity is a secondary and less important issue. We can hand wave it away as "oh but it's such a small amount of data" but if it's such a small amount, it should also be possible to fix the issue easily due to my [*] point above.

But they aren't build dependencies though. You don't need to generate the Unicode tables to build this stuff, because the Unicode tables are meant to be committed to source control. Other than that quibble, I'm not sure I see the spaghetti here. If what I'm doing is spaghetti, then what everyone else is doing is even worse, because they are just hacked together Python scripts:

If you look carefully at the Python scripts above, you can see that a lot of them clearly share some ancestry. What effectively happened is that the same Python script was copied around to the various Unicode crates, tweaked for a new purpose, and then used to generate code based on Unicode tables. This is exactly how ucd-generate was intended to work, because I saw this unmaintainable pattern of a hodge podge of Python scripts and decided to replace it with something a bit more principled.

Things got a bit muddled with the addition of the dfa and regex sub-commands, since as I mentioned above, those are the only commands that don't use the UCD data directory since they rely on the regex-syntax machinery to expand Unicode properties. But other than that, ucd-generate is itself mostly agnostic of the Unicode version. There are some parts that are tied to a Unicode version that don't have anything to do with Unicode data tables. For example, the various routines in ucd-util implement algorithms, and those algorithms are described by the Unicode standard (or its supplementary materials).

So I'm not sure what to do here. It sounds like what you want is clearer provenance, and it seems to me like there should be much simpler solutions than what you've proposed thus far. Brainstorming:

  • I could start a new convention with a document in each repository that specifies the Unicode version used for the tables it embeds.
  • I could cause the various generate-* scripts to download a specific version of the UCD database for you, instead of requiring you to do it yourself.

@BurntSushi
Copy link
Owner

BurntSushi commented Jul 21, 2019

Also, I'd like to repeat that I'm still flying blind here. Is there a document that lays out the policy that the FTP masters are enforcing? (The extent of my ignorance is deep. I don't even know what an "FTP master" is.)

@infinity0
Copy link
Author

infinity0 commented Jul 21, 2019

Why

But they aren't build dependencies though. You don't need to generate the Unicode tables to build this stuff, because the Unicode tables are meant to be committed to source control. Other than that quibble, I'm not sure I see the spaghetti here.

"committed to source control" is an undeclared dependency, on:

A. an unknown version of ucd-generate
B. an unknown version of Unicode data

What you're doing is like building some minified javascript, commiting it to source control, and using it as an opaque piece of data during the build. That will work around technical restrictions around dependencies and ordering, but the FOSS requirements of Debian policy (which are really just moderately-strong auditing requirements derived from legal requirements) still require us to track it, and the solution here in Debian is usually to (1) delete the generated files and (2) re-run the generation command. However this is not possible here due to the cycles.

Your example about RE2 is different. We can do both (1) and (2) for RE2 and the output of RE2 is (to my knowledge) not used to build generated "source" code for other packages. Or if it is, I'd guess it's done in a way that could be replaced by any other regex implementation and is not specific to that version of Unicode. That's not the case with ucd-generate/bstr where the FSMs are very Unicode-specific and clearly depend on a particular versions of (A) and (B).

If what I'm doing is spaghetti, then what everyone else is doing is even worse, because they are just hacked together Python scripts [..] I saw this unmaintainable pattern of a hodge podge of Python scripts and decided to replace it with something a bit more principled.

The hodge podge is bad but different from "spaghetti" which refers to cycles of dependencies that need to be manually untangled. I applaud your effort to factor away the hodge podge but it's important not to introduce other issues at the same time...

Is there a document that lays out the policy that the FTP masters are enforcing?

There is no document that goes into the situation in as much detail as I am doing, so I don't think you'll be convinced by any of the existing documents. "clearer provenance" basically sums it up however. For the purposes of this discussion you can read "FTP master" as "copyright auditor" that you might find in the legal department of a company.


How

Cycles

Your suggestion to move the dfa and regex subcommands into regex-automata makes sense and IMO should be done, since then ucd-generate can indeed truly become independent of the Unicode version (modulo the below minor issue about jamo-short-name) and break the cycle here. If I understand correctly, this would also result in bstr using these regex-automata tools rather than ucd-generate directly.

the only data files generated by ucd-generate that are used by the ucd-* crates is the tiny JAMO_SHORT_NAME table

After looking at this code in a bit more detail, it would be more correct to read these names from <ucd-dir>/Jamo.txt i.e. the same directory as the rest of the unicode data that ucd-generate is reading, instead of from a pre-generated table. That would get rid of this cycle.

Are there any more cycles? Wouldn't these two fixes break all of them?

Provenance

It sounds like what you want is clearer provenance, and it seems to me like there should be much simpler solutions than what you've proposed thus far. Brainstorming:

  • I could start a new convention with a document in each repository that specifies the Unicode version used for the tables it embeds.
  • I could cause the various generate-* scripts to download a specific version of the UCD database for you, instead of requiring you to do it yourself. [..]

With the caveat that the default download version simply equals the "new convention" version, this would be a more-or-less equivalent but manual version of the build.rs stuff I suggested previously, so why not just reuse Cargo.toml's dependency syntax and automate the process? To save effort, build.rs could simply call python generate.py or whatever manual command you would have run.

Re-generating in build.rs doesn't take that long. You could feature-gate it so that it's optional, just like how e.g. C-library rust crates often have a "vendor" feature to let the user choose between the system version or the vendored version. In this case the feature could select between regenerating-from-scratch or reusing the pregenerated version.

Provenance doesn't just mean "it is potentially theoretically possible for a human to spent a long time to figure out how everything was built", it means being able to do this cheaply and quickly, and that means being able to run it in CI automatically.

edit (addenum): Since we're making ucd-generate fully independent of the Unicode version, a build.rs-based approach would have to select the version of the Unicode data. This could be done by e.g. build-depending on a specific version of a unicode-data-sys crate that tracks this data.

@BurntSushi
Copy link
Owner

BurntSushi commented Jul 21, 2019

Your suggestion to move the dfa and regex subcommands into regex-automata makes sense and IMO should be done, since then ucd-generate can indeed truly become independent of the Unicode version (modulo the below minor issue about jamo-short-name) and break the cycle here. If I understand correctly, this would also result in bstr using these regex-automata tools rather than ucd-generate directly.

That's correct. It will be mildly annoying to do this, but if it solves your problem I'm happy to do it. It also probably makes more sense in the grand scheme of things. The current situation was a marriage of convenience, since ucd-generate already has a little infrastructure built up for writing out the codegen.

Note though that ucd-generate would still have a dependency on regex. It just wouldn't rely on the specific Unicode version embedded into regex for its output.

After looking at this code in a bit more detail, it would be more correct to read these names from /Jamo.txt i.e. the same directory as the rest of the unicode data that ucd-generate is reading, instead of from a pre-generated table. That would get rid of this cycle.

Reading from Jamo.txt is done by ucd-parse, and I do not want to add a dependency on ucd-parse from ucd-util. I could just re-roll the parser of Jamo.txt inside of ucd-util, but that just seems crazy to me when ucd-generate can already do it.

I'll instead probably just change ucd_util::hangul_name to require the caller to pass in the JAMO_SHORT_NAME table, such that the caller will be responsible for using ucd-generate to put that table into their crate. And then put out a breaking change release. I don't think anyone uses the crate except for me, so it's not a big deal. (ucd-generate uses ucd_util::hangul_name, so ucd-generate will just need to generate the table in memory in order to call this routine. Which I guess is fine and easy to do.)

Are there any more cycles? Wouldn't these two fixes break all of them?

I believe so, yes. So long as "cycle" here refers to a conceptual cycle. There will still be the cycle that ucd-generate depends on regex which depends on regex-syntax which has its tables generated by ucd-generate. But as I mentioned above, this isn't for regex's Unicode support. It's just for parsing the UCD data files.

With the caveat that the default download version simply equals the "new convention" version, this would be a more-or-less equivalent but manual version of the build.rs stuff I suggested previously, so why not just reuse Cargo.toml's dependency syntax and automate the process?

I don't see much reason to put this stuff into build.rs when a shell script will do just fine. None of the other crates in the ecosystem seem to have this requirement (like shelling out to a Python script), so I don't see the reason to do it here. There's just absolutely no way I'm going to attach a dependency (even if it's optional) on ucd-generate from something like the regex crate. Let's please take this off the table. Even if other approaches seem equivalent to this one, they aren't, because this one ties the dependencies into the Cargo build system for zero gain (unless you're a lawyer) when compared to other approaches, as far as I can see. EDIT: I am trying hard to keep dependencies to a minimum, which is super difficult, and this would just exacerbate it. Moreover, I'm not even sure it's possible: can Cargo have a build-dependency on something that transitively depends on itself?

Provenance doesn't just mean "it is potentially theoretically possible for a human to spent a long time to figure out how everything was built", it means being able to do this cheaply and quickly, and that means being able to run it in CI automatically.

Right, that makes sense. Why isn't a combination of shell scripts and better documentation enough for this? Running it in CI is also something I could potentially get on board with, but sometimes the codegen takes a long time, and I'm not sure what I'd do with the result. I suppose I could just run the codegen scripts and assert that there is no error. Maybe something more elaborate would diff the output with the pre-committed files.

Re-generating in build.rs doesn't take that long.

Yes it does. We're talking minutes here. From the bstr repository root:

$ time ./scripts/generate-unicode-data
generating forward grapheme DFA
generating reverse grapheme DFA
generating forward sentence DFA (this can take a while)
generating forward word DFA (this can take a while)
generating regional indicator DFA
generating forward simple word DFA
generating forward whitespace DFA
generating reverse whitespace DFA

real    1:52.11
user    1:51.90
sys     0.132
maxmem  116 MB
faults  0

Not everything takes this long. Simple Unicode tables are generally pretty quick, but it's still on the order of a few seconds. In addition to needing to compile ucd-generate, the build time for a lot of crates would increase substantially when this feature was enabled.

edit (addenum): Since we're making ucd-generate fully independent of the Unicode version, a build.rs-based approach would have to select the version of the Unicode data. This could be done by e.g. build-depending on a specific version of a unicode-data-sys crate that tracks this data.

I grant that this sounds nice, and if I were otherwise okay with the build.rs approach, then yeah, this probably makes sense. Although, setting up the unicode-data-sys crate would be a tedious pain and/or require more tooling, especially when updating it to a new Unicode version.

@infinity0
Copy link
Author

infinity0 commented Jul 21, 2019

I am trying hard to keep dependencies to a minimum, which is super difficult, and this would just exacerbate it. Moreover, I'm not even sure it's possible: can Cargo have a build-dependency on something that transitively depends on itself?

The conceptual dependency already exists whether it's undeclared or declared, all I'm saying is there should at least exist an option (via features) to build it in the explicitly declared form. Cargo can support dependency chains like { ucd-generate -> regex -> regex-syntax/no-default-features, regex-syntax/unicode_tables -> ucd-generate } which on a first glance into the regex-syntax crate, seems like it might be possible to do.

Why isn't a combination of shell scripts and better documentation enough for this?

Debian runs all builds in CI, and for generated code like regex-syntax/src/unicode_tables/* we really prefer to generate all of that from scratch on every build, from the actual source code (i.e. ucd-generate and Unicode data). If this is done using Cargo's build.rs mechanism we don't have to hook up anything special, the standard Debian cargo integration stuff will Just Work. If it is done via shell scripts we have to set all of that up manually for each crate. On top of that, if you are suggesting that everyone in the Rust ecosystem should use these sorts of ad-hoc shell scripts to run ucd-generate, this will exacerbate this issue since we will have to call these script manually everywhere. (I think right now we are already neglecting to call all of those python scripts you linked, which is not where we want to be as a FOSS project.)

We're talking minutes here.

This is why I suggested using features to gate what build.rs does. I do appreciate the gain in having this stuff be pre-generated for developer builds. When running in CI on a build farm however, this is not such a big deal, and the other factors I mentioned are more important.

@BurntSushi
Copy link
Owner

What I'm hearing is that hooking ucd-generate up to build dependencies is a nice-to-have and not a requirement, so I'm not going to pursue that. There are tons of crates in the ecosystem that use code generation like this, even beyond Unicode stuff. The encoding and encoding_rs crates are examples. It seems to me like if you really need to be able to regenerate code like this, then you're going to need a bespoke system to do it anyway, unless you successfully convince everyone in the entire Rust ecosystem to do code generation through build.rs.

@BurntSushi
Copy link
Owner

And then there are probably oodles more examples that use bindgen to generate Rust ffi bindings for C code. bindgen can be hooked up to run in build.rs, but plenty of people don't do that, inciuding myself. For example, the pcre2 crate has code generation from bindgen committed to the repo.

@infinity0
Copy link
Author

These are all anti-patterns that make life harder for packagers that are trying to follow provenance best practises, and effectively "cheats" the crate into passing reproducible builds. Yes we are trying convince everyone in the entire Rust ecosystem to do code generation and everything-else-generation through build.rs. I would've thought everyone was already doing it, since it is the principled thing to do.

It is a requirement for Debian policy and if the crate does not do it, it makes our lives harder and the crate slower to package for Debian. Due to limited volunteer resources, sometimes stuff in breach of this makes its way into the Debian archive, but this should not be interpreted as "Debian policy allows it".

@BurntSushi
Copy link
Owner

This whole ordeal has just been super frustrating. We obviously started with different definitions of what is "principled." Moreover, the entire Debian policy here has been terribly opaque to me, with requirements and nice-to-haves and processes being completely unclear. A lot of this stuff seems like it is just coming from out of thin air. For my part, other than the cycle issues, I generally think I am being principled about this. Plenty of companies have used my Rust code, so it has presumably passed through their legal departments without issue since this is the first I'm hearing of this provenance issue. So from my perspective, this is looking to me like Debian has a terribly inconvenient policy. I don't think you get to hide behind "that's what a good FOSS project does." There are plenty of good FOSS projects that don't do this.

What you're asking for is both a lot of work and a complication in how things get built, in addition to long standing bugs in Cargo that would likely impact your suggested solution.

I understand why doing this in build.rs would be nicer for you. I can appreciate why it is more principled, and even agree with that perspective. But principles are not the only thing that matters.

@infinity0
Copy link
Author

Everything I've said so far are very natural derivations from (1) the definition that source code == preferred form of modification, and (2) "FOSS" refers to source code, not generated code. This is hardly "opaque" or "hiding". This for example is why the GPL explicitly refers to build scripts, because it would be a loophole if a user can look at the purported source code but can't easily build it into something executable.

The Debian project is indeed more strict than other projects about it, including companies with legal departments. I am at a loss to why you think the obvious and natural suggestion that "generated code be actually-generated" by the end user for FOSS verification purposes "comes out of thin air".

I don't know what you exactly mean by "plenty of good FOSS projects that don't do this". A project can have world-class software engineering and design in its code and core functionality, and still make mistakes in how it designs its build process especially when it comes to adhering to FOSS principles. It was (e.g.) a common mistake to check in autogenerated ./configure files into source control, but FOSS best practices established across several decades says one shouldn't do it. For the longest time Debian would delete and regenerate these files, and these days we still do it but it's less necessary - most upstream projects are generally fixed. I can appreciate a lot of people in the Rust ecosystem may not have this sort of background, and now I'm trying to explain why these are good things to do. I am not implying the software engineering in the main part of the code is bad.

What I am suggesting is not that much work. I can submit a patch once the other stuff you mentioned previously is done, and you can see how much it is.

This is a Debian "advice" document that explicitly mentions what I am saying, but the same ideas are repeated in slightly different forms in many other places.

@BurntSushi
Copy link
Owner

Thanks for linking to something. That helps.

Everything I've said so far are very natural derivations from (1) the definition that source code == preferred form of modification, and (2) "FOSS" refers to source code, not generated code. This is hardly "opaque" or "hiding". This for example is why the GPL explicitly refers to build scripts, because it would be a loophole if a user can look at the purported source code but can't easily build it into something executable.

As I've stated a few times, I am not opposed to better provenance tracking. I am simply strongly skeptical that adding build dependencies for core crates in the ecosystem on something that was built as a tool for maintainers (that is, ucd-generate) is a good idea. I do not want the maintenance burden that is implied by having ucd-generate be a proper dependency of several core crates.

On top of all of that, aside from the cycle problem, I don't see how the status quo is in any way contradictory of the policies you've described. It sounds to me like your build-dependencies idea is strictly about making packaging easier, as opposed to being a necessary implementation of Debian's policy. It sounds like these things are being conflated. Maybe I'm to blame here, but as I've said, I've been confused for a large part of this conversation. With that said, I do not want to do stuff that makes packaging harder than it needs to be, since I can appreciate the considerable burden y'all have with the number of crates in our ecosystem. But there's another side to this where, ya know, Debian is choosing to do this additional work in accordance with their policies. There obviously has to be some give and take here, and y'all have done a great job with this, I must admit.

I do appreciate your offer to submit a patch, but I not only think you're under-estimating the amount of work involved here (building the unicode-data-sys crate and making it so ucd-generate can interface with it is a chore), but I think you're also under-estimating the maintenance burden moving forward. As I mentioned above, pushing something like ucd-generate into core crates will likely increase the number of bug reports against it. There's a big difference, speaking from experience, between a project that works well in Unix environments for maintainers of Unicode crates, and a project that works well in a variety of environments for anyone who decides to use that feature. Moreover, the manifest interaction between build dependencies, dev dependencies and normal dependencies will become more complex. As I linked in my previous issue, there are bugs here w.r.t. build dependencies that would be critical blockers for crates like bstr, which explicitly support a no_std mode. (Its build dependency would need to enable additional features on regex-automata to do code generation, which would in turn break its use in no_std mode.) On top of all of that is the MSRV situation, which I try to be conservative with in core crates like regex, but there is intentionally no stringent MSRV policy for things like ucd-generate because it's not worth it. This isn't insurmountable, but it's just another layer of complexity to add to all of this.

So basically, when I take a step back from all of this and look at it in the grander scheme of things, then we have on the one hand this (IMO, complex/bloated, somewhat of an impedance mismatch) choice to add the code generation step to build.rs, and on the other, we have, "maintainers just need to run this shell script and commit the results." There is a marked difference between these from my perspective, and from my perspective, both avenues appear to satisfy Debian's stringent policy. Namely, at no point would I advocate for a state of affairs that makes code generation impossible. (Again, aside from the cycle issue, which I agree should be fixed. While the cycle issue doesn't make it impossible to re-generate all files, it does make it quite annoying and counter-intuitive.)

I think the other part of the problem I've been having here is probably somewhat emotionally irrational, but it's of the form, "why are you picking on me now, when all these other examples of software packaged by Debian didn't have to go through this." I understand why, as you explained, those other pieces of software just slipped through the review process. But now I'm at a point where it seems like this policy isn't a strict requirement, and it feels to me like I'm having significant extra work foisted on to me because of a Debian policy that I largely think is unnecessary. Namely, for my projects, I do try to make sure that all auto-generated files can be re-generated. And in cases where I've failed there, I'm happy to take some steps to improve that. So there's no essential conflict in the values we share here; we're just differing on the extent to which this process is baked into tools.

@BurntSushi
Copy link
Owner

Also, AIUI, Cargo does not support binary dependencies, so using build-dependencies seems to imply doing #3.

@infinity0
Copy link
Author

Thanks for looking into all of those details. They are indeed heavier than what I originally imagined with a build.rs based approach. I might still try making a patch at some point, but I can understand if you don't want to pursue it right now.

It sounds to me like your build-dependencies idea is strictly about making packaging easier, as opposed to being a necessary implementation of Debian's policy.

Not just easier but more transparent - when you're looking after 500 rust crates it is hard to manually review every one to see if it is using ad-hoc shell scripts as part of the build or not. If ucd-generate is intended to replace all those existing python scripts, it would be appreciated if you could establish a convention for them to be called, that could potentially one day be automated in the Debian rust integration scripts.

we have on the one hand this (IMO, complex/bloated, somewhat of an impedance mismatch) choice to add the code generation step to build.rs, and on the other, we have, "maintainers just need to run this shell script and commit the results." There is a marked difference between these from my perspective, and from my perspective, both avenues appear to satisfy Debian's stringent policy.

To be clear, the second option does not directly satisfy the policy as-is, the Debian packager will have to add a crate-specific tweak to the Debian build, that deletes the generated code then calls the shell script to regenerate it. If you establish a convention for doing so, then this tweak could be automated on the Debian side.

As for this remaining ucd-generate - ucd-parse - regex - regex-syntax cycle (that exists even after dfa and regex subcommands are moved out), I'll have to think some more about how best to deal with that in Debian..

dingelish pushed a commit to mesalock-linux/ucd-generate-sgx that referenced this issue Aug 7, 2019
I personally prefer using shell scripts for this sort of thing. Python
will work in more environments, but it's not clear how useful that it is
for a rare code generation script.

See also BurntSushi#11 for a bit more discussion.
dingelish pushed a commit to mesalock-linux/ucd-generate-sgx that referenced this issue Aug 7, 2019
This brings in the Unicode 12.1.0 update from regex. ucd-generate relies
on this for its `dfa` and `regex` sub-commands.

See also BurntSushi#11.
BurntSushi added a commit that referenced this issue Jul 6, 2023
These have been moved to regex-cli and now use regex-automata 0.3:
https://github.com/rust-lang/regex/blob/master/regex-cli/README.md#example-serialize-a-dfa

This also breaks the cyclic dependency where updating to a new Unicode
version for bstr required the following:

* Run ucd-generate to update regex-syntax tables.
* Publish new regex-syntax.
* Update ucd-generate lockfile to bring in new regex-syntax.
* Build new ucd-generate binary.
* Run ucd-generate to update bstr regexes.

Namely, that last step requires updating regex-syntax in order to
propagate the Unicode updates into the regex engine.

The new process is:

* Run ucd-generate to update regex-syntax tables.
* Build regex-cli (also in the regex crate repo).
* Run regex-cli to update bstr regexes.

So now we don't have to do this weird dance where we loop back around to
build a new version of ucd-generate.

ucd-generate does still depend on `regex` at the moment via
`ucd-parse`, but this doesn't need updating when a new version of
Unicode comes out. Still, I'm going to explore breaking that dependency
as well via `regex-lite`.

ucd-generate also still depends on `ucd-util` which also has Unicode
data embedded into it. I'm going to look into fixing that by requiring
the caller to pass in the data tables.

Fixes #11
BurntSushi added a commit that referenced this issue Jul 6, 2023
This breaks a dependency where `ucd-util` dependend on running
ucd-generate to produce the JAMO_SHORT_NAME table. Instead, we now
require the caller to provide the table.

Fixes #11
BurntSushi added a commit that referenced this issue Jul 7, 2023
These have been moved to regex-cli and now use regex-automata 0.3:
https://github.com/rust-lang/regex/blob/master/regex-cli/README.md#example-serialize-a-dfa

This also breaks the cyclic dependency where updating to a new Unicode
version for bstr required the following:

* Run ucd-generate to update regex-syntax tables.
* Publish new regex-syntax.
* Update ucd-generate lockfile to bring in new regex-syntax.
* Build new ucd-generate binary.
* Run ucd-generate to update bstr regexes.

Namely, that last step requires updating regex-syntax in order to
propagate the Unicode updates into the regex engine.

The new process is:

* Run ucd-generate to update regex-syntax tables.
* Build regex-cli (also in the regex crate repo).
* Run regex-cli to update bstr regexes.

So now we don't have to do this weird dance where we loop back around to
build a new version of ucd-generate.

ucd-generate does still depend on `regex` at the moment via
`ucd-parse`, but this doesn't need updating when a new version of
Unicode comes out. Still, I'm going to explore breaking that dependency
as well via `regex-lite`.

ucd-generate also still depends on `ucd-util` which also has Unicode
data embedded into it. I'm going to look into fixing that by requiring
the caller to pass in the data tables.

Fixes #11
BurntSushi added a commit that referenced this issue Jul 7, 2023
This breaks a dependency where `ucd-util` dependend on running
ucd-generate to produce the JAMO_SHORT_NAME table. Instead, we now
require the caller to provide the table.

Fixes #11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants