-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically load parsers #44
Comments
I would appreciate help on where to put dynamically-loaded files on other OSes. For linux I'm pretty sure it would be |
To add my two cents: IMO, the DSL would be the best pick because if done correctly, it shouldn't have a major impact on the speed of the whole program. Maybe the DSL bytecode could be cached? About the WASM interpreter; it could work but the binary size would grow quite a lot and if its not jitted or well optimized, I don't think it would be as fast as a custom DSL. PS: Shared libraries as parsers would be very fast but making them would be a hassle. Everything would need to be in the C ABI and if you have to make a change to the ABI, all parsers would need to be recompiled. 2nd PS: Maybe you could store parsers on Windows in a subdirectory inside |
I have to admit I lean towards the DSL. It does seem like the best option. Right now my main concern is whether to make it an imperative language (i.e. you write code that describes how to parse a file format, similar to the current approach) or a descriptive one (i.e. defining the structure of the file format, and letting hevi parse it following it as a "guide"). About the caching thing: caching the bytecode would be pretty important in case the imperative route is taken. And, if it ends up being a descriptive one, it's not really an issue, as there would be no need for a bytecode IR.
100% right. Plus, WASM APIs are usually pretty cumbersome. Basically, it has the disadvantages of all other options, while making the executable significantly bigger.
I also don't like the fact that they cannot be easily sandboxed...
That seems pretty reasonable. Thanks! |
A descriptive DSL would be the perfect solution, at least that is what I want to say. However, it depends greatly on how much do you want to highlight the file, should it be simple? where hevi only parses the headers and some checksums (and even as simple as that could still be a challenge) or very specific? (like highlighting the chunks of a Unfortunately, file formats are very different and complex on their own and developing a description that works for all of them would be almost impossible. If just simple highlighting is needed, then maybe it could work.
AFAIK, I think that even caching the tokenized DSL could be useful if speed is important. |
I was thinking of something similar to the following:
where you assign to an identifier to create a construct (with the special identifier My main concerns are: Non-tree formatsMost file formats form a tree-like structure of slices of memory, with every children slice being contained in its respective parent slice. Things like ELF don't. There's a header that is of tree form, but the actual contents are found with a given offset and size in the header. I don't know how to express this properly. One idea is to, when EOF assertionIf using the pointer approach, we can no longer assert for EOF (at least not as easily). When you do Variable-size constructsWhat about a file with a Data interpretationWhen using variable-sized constructs that depend on data of the file, or when conditionally parsing depending on some data, data interpretation becomes relevant. Maybe we should use |
Looks great! The thing I was most concerned about were variable sized constructs and non-tree formats. For example, numerous image formats like
If following the pointer approach, EOF assertions could be enabled if it doesn't contain any.
This is a must, what you described should be what it does, and it would be even better if simple expression evaluation is implemented (i.e.
IMHO, It's better to have a prefix on elementary constructs that conveys their meaning like About endianness, having suffixes for them like So, for example and based on what you've said and I mentioned, this could be a description for a qoi image parser (Seeing that one is already implemented in zig): @root = struct {
magic: b32 = "qoif",
width: u32b, // Big endian
height: u32b, // Big endian
channels: u8,
colorspace: u8,
encoded_image: ... // Anything else to say "spans until EOF"?
} Am I correct? |
I really like your suggestions, specially the
Instead of assuming host endianness (which is not something that really appears on file formats afaik, and would just make parsers misbehave in some architectures) there could be a global "property" (e.g. |
This is a long-term idea that won't be implemented until version
v2.0.0
. The idea is to have the parsers be a separate file that the user can download and manage. So the parsers wouldn't be built-in. It's a very nice way to offer a large variety of parsers without making the binary huge. There are a few ways to do it:Comparison
Back-compatibility
Either way, this would break how parsers work. That could be an argument to put it in the
v1.0.0
release. However, this would take a lot of time and would postpone the release even more (I've wanted to release it for a while now).The text was updated successfully, but these errors were encountered: