-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Emitting code to search tables with complex layouts #39
Comments
I think our sensibilities are pretty well aligned here. (2) is also kind of where I land as well. I also do not like forcing folks to bring in tiny micro-crates. Because of that, it leads me to not want to make extremely complex formats, but that in turn obviously biases against optimizing the heck out of these tables. And I'm generally pretty amenable to doing what we can to optimize them since they tend to be so widely used and foundational. So yeah, I agree. If you wanted to go down this route, I'd support (2). |
Wow, it's great that you're active in this repo again. The timing almost couldn't be better, as I've been playing around with unicode tables very recently (for libcore). Most of these ideas would absolutely require custom code query the tables, though, as is the case for pretty much anything more complex than "binary search". Many of my ideas ended up probably being non-ideal for libcore, where saving a few kilobytes on table size isn't worth a performance regression, as most of the tables it has are used by methods people actually use. But for a case like |
Indeed, |
Right now, some tables are too complex for it to be reasonable to ask users to write the table search code themselves.
Currently, the model we follow for this is to have the table depend on an external crate, for example:
ucd-trie
, orfst
. Lets ignorefst
, because I have only surface level understanding of it (I really should get around to reading the blog post on it...).Anyway the cases I care about are much closer in spirit to TrieSet:
Accessing the data in the trie requires a very specific set of undocumented operations. Even if it were documented, it's fiddly enough that we don't want users to have to write these lines themselves (even asking them to copy/paste them in would imply that they should understand it and are taking over maintenance of it).
We'd also like to think of the trie format as subject to change... (even though adding the statics essentially freezes it, as I'll get into) So we just ask them to take a dep, and ensure it can be as close to zero-cost as possible (
#[no_std]
, functions inline, etc) so that nobody really has an excuse for a reason to not use it.That said, this approach has downsides:
Problems with the separate-crate approach
The main problem is that it "freezes" the format of the trie -- any update that doesn't maintain 100% compatibility with existing tries must be a breaking change, or come in a
TrieSet2
type or something. This might be acceptable in the case of the trie (IMO it's not -- it's not like there the trie is impossible to improve upon), but I don't think it's a good solution overall.Users have to include
ucd-trie
as a dependency, and the dependency is bringing in a single 18 line function (well, and a type definition).This is somewhat petty of me, but there are fairly few cases when I'm comfortable bringing in an external dependency for an 18 line function. Given that
bstr
has a line about "pushing back against the micro-crate ecosystem" in the readme, I suspect you can at least see my reasoning there.That said, the downsides to this (aside from introducing dependency) are subtle, and it's not exactly clear what else to do -- using the language-provided method of code sharing for this is clean and obvious, if nothing else.
Anyway, the first point is what I care more about (or at least, I'd just suck up the second point if that were it).
In particular, I'd like to introduce new formats which are not frozen, and can change without worrying about compatibility with an external crate. Specifically (With most of these, I'd want the data structure exposed to be totally opaque, or potentially just expose a single function):
Even some of the current formats are getting complex (
--split-ranges --separated-values
isn't that hard to write code for, but it's complex enough that you probably need documentation to do so).But even for simple formats, like the slice of (u32, u32), I've gotten reading it it wrong more times than I want to admit![0] This may say more about me as a programmer than anything else, but I would have definitely used an option to spit out the table searching code.
Potential solutions
There's basically a few options here.
1. Stick with the external crate approach.
2. Emit code for searching in the same file.
--minimize
flag that picks whichever format has the smallest output size.Copy
impl in profiler output and similar.3. Have a dedicated option for emitting the shared code in its own module...
3a: ...and have the user be responsible for passing the table and search term into the functions in that module. (This requires the table structure be defined as just tuples, slices, etc. And not require shared type definitions)
3b: ...and have the user pass the name/path of the shared code as an argument when generating the table (e.g. emitting to
unicode_common.rs
and then passing--module-path 'super::unicode_common'
to everything else).The pros/cons for these two are similar:
Pro: No duplicated code.
Con: All the stuff about users having "code in their project that they don't maintain" from above.
Con: It feels a little weird since we're explicitly not using the the language-provided method of sharing code.
Con: Need some way of saying which table layout you're using, and thus need the code for. E.g. interacts poorly with how ad-hoc we are about ouput kinds is.
Con: For 3a, this doesn't fully addresses the compatibility issue.
Because the table structure has to be "defined as just tuples, slices, etc", the user can (and likely will) depend on on the concrete layout here, meaning some types of updates could cause compile errors.
Even if the layout remained constant, and just the interpretation of it changed, the user could forget to regenerate the shared code to match. This isn't a case I care as much about -- And it could be addressed in part by emitting exhaustive tests with the table data.
unsafe
in the table access code, which isn't a thing we do now, but in some cases it could be worth it, potentially.In 3b, neither of these issues exist -- we can provide structure definitions, and we could emit a const indicating a version number in the shared code, and a static assert that they match in the table module. This is possible because the table module has the path to the shared code, which is not true for the 3a. (Something like
const _: [(); shared::BITSET_VERSION] = [(); 1];
)4. many-table output
Change ucd-generate so that it can output as many tables as are needed in a single run of the command. Then, it would generate the shared code for exactly the set of tables types which are needed as well as all the tables.
This is of course not complete, there are also hybrids you could imagine... But it's getting long.
Opinions
In reverse order -- popping each of these off my mental stack:
Choice 4: multi-table output
I like the idea of this choice: Generate only what's needed, and have a full understanding of desired dataset, potentially opening further optimizations.
And really, I wish we lived in a world where we already had this information... But that's not the world we live in, and I don't even know what this would look like in the CLI. How much code would have to change? Honestly It doesn't seem worth it. I suspect cross-table optimizations would not be worth a major change like this anyway, and things that do prove useful can just be handled on a case-by-case basis with a new subcommand or flag or whatever.
Choice 3: New subcommand to emit code in separate module
This seems very fiddly. In particular I think describing the sets of all the things you want to search means it has similar problems to choice 4 (need to describe the set of tables you're generating).
If we made the code generation bits more consistent and used that to be more principaled about the CLI options it wouldn't be that bad to add, but I don't think it has enough benefits to outweigh the complexity and downsides at the moment.
Choice 2: Emit code in same module.
This doesn't have many major downsides, but looks a little goofy. I think the concerns about code style are fine so long as what we emit passes default clippy lints and such.
If we do this I suspect it's only a matter of time before someone ask for choice 3, but maybe things will be cleaner then and it will be easier to add. This wouldn't prevent that, or anything.
Choice 1: Status quo
If I liked this, wouldn't have filed this issue.
Wrapping up
So yeah, number two is my opinion. In practice the changes here would be:
Thoughts?
(Note that a large part of the reason behind this is because I feel like just submitting a PR with one of those weird data structures, and the code generated next it would at a minimum raise, eyebrows. I also feel like the current status is not ideal).
The text was updated successfully, but these errors were encountered: