-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-stream API #65
Comments
It's still some API, although different implementation. You can take a look at Java, for example: it used one set of underlying classes first, then we've switched to current set (which uses memory-mapped files internally). If you want to implement any such API in another input paradigm / tooling — you're welcome to go ahead and contribute :) If you want to avoid method calls to such API and instead embed API calls into the target code — that's possible too, we've already discussed a possibility to do such "static linking", i.e. instead of: foo = _io.readS1();
bar = _io.readS4be();
baz = _io.readStrByteLimit(16, "UTF-8"); we'll just inline runtime code right inside the generated code: foo = _bb.get();
_bb.order(ByteOrder.BIG_ENDIAN);
bar = _bb.getInt();
{
byte[] _buf = new byte[16];
bb.get(_buf);
baz = new String(_buf, Charset.forName("UTF-8"));
} I doubt that it will be any faster (C++/C#/Java can do proper inlining automatically), but probably it could save some hassle to do proper linking of runtime library, etc. |
Not quite this, buy a very related issue. For now KS parses whole stream and stores it in memory. Imagine what will happen if you try to parse multi-GiB file with it. The solution is memory-mapped files. But KS is not memory friendly. So we need another stream API model. For every parsed KSc creates code creating a memory map object. This object is stored in a structure currently parsed and contains offset of the field and its length. Then KSc creates properties (an object behaving like value but instead calling accessors, it is possible even in c++) which take the offset from memory map and get the data by offset. The OS will take care about the rest. |
There is some support for lazy parsing (take a look at any filesystem implementation) — it allows you to not "parse whole stream and store it in memory". Generally, the question is solved easily if your file has some sort of index, thus you can create an API that will use this index to parse file parts on demand. If you just have a long stream of structures of different sizes, you usually work with them differently, i.e. you can blindly seek to a middle of the file and try to find next "frame" or whatever by some sync bytes. For example, if you want individual frames in a certain media codec, you can't afford to parse them all just to build an index of offsets/sizes of each frame.
Java runtime uses memory-mapped files. I don't quite understand how exactly it would help — if I got you right, memorizing offset+size and re-reading the stuff can be done with regular file APIs as well. |
Some formats don't have index table, but have a header (without any offsets) and continious stream of structures without sync bytes (but often these structures have fixed size, this can be used for optimization, if the compiler can derive it).
Yes, it can. But at least for python (I haven't looked into other runtimes) this cannot be done without much perversions. The problem is that:
|
Ok, fair enough. That's a separate issue with a separate solution — i.e. "lazy parsing of arrays with fixed element sizes", which can be achieved by some sparse array data structure + arithmetics. I'll create an issue now.
Sorry, but I don't really understand what problem you're trying to solve here in the first place. It looks like you're trying to do "do some lazy parsing by changing runtime without touching the compiler and generated code". I'm proposing the exact opposite: "change compiler (and generated code) to reuse existing runtime". For example, let's take a problem "do lazy parsing of a structure with an apriori known size". Something like this: seq:
- id: foo
type: u4
- id: bar
size: 100500
type: something_else
lazy: true
- id: baz
type: u4 Right now the following constructor body is generated (if we remove self.foo = self._io.read_u4le()
self._raw_bar = self._io.read_bytes(100500)
io = KaitaiStream(BytesIO(self._raw_bar))
self.bar = self._root.SomethingElse(io, self, self._root)
self.baz = self._io.read_u4le() I'm implying that we can generate something like that in constructor: self.foo = self._io.read_u4le()
self._ofs_bar = self._io.pos() # remember position
self._io.seek(self._io.pos() + 100500) # just skip over these many bytes
self.baz = self._io.read_u4le() and, in addition, add a special lazy getter which is very similar to what's generated now for instances: @property
def bar(self):
if hasattr(self, '_m_bar'):
return self._m_bar if hasattr(self, '_m_bar') else None
_pos = self._io.pos()
self._io.seek(self._ofs_bar) # that position we remembered previously
self._raw__m_bar = self._io.read_bytes(100500)
io = KaitaiStream(BytesIO(self._raw__m_bar))
self._m_bar = self._root.SomethingElse(io, self, self._root)
self._io.seek(_pos)
return self._m_bar if hasattr(self, '_m_bar') else None The only line which is special here is the commented one. No need to change any runtimes, works with every target. |
It looks like you're trying to do "do
some lazy parsing by changing runtime without touching the compiler and
generated code".
I tried to do this first, but failed becase I strongly dislike perversions. So you see this issue raised.
Your code is OK, but I wanna notice something.
1 We can make all the fields properties and not to use read_* directly in constructor, but rely on properties (their offsets and sizes should be already known prior getter invocation).
2 We can right now introduce setters for primitive types. Since we don't have a symbolic engine for now, using of setters can be dangerous (can violate constraints), but it can be available right now (if we had writing functions) for use on own risk.
|
You're implying that we should make all
Sure, feel free to go ahead and plan / implement |
Yes. This will also give us some kind of serialization.
Yes, this will be slower. I see no way to solve this problem in python: numbers are value-types (immutable) there and a one cannot get a reference to them even using bytecode. So I guess this tradeof should be tweakable by compilation flags. |
I agree with @GreyCat . Lazy parsing is a separate issue and memory-mapped IO is unrelated. |
In some languages the goals can be achieved without stream API. The example of such language is any language allowing accessing memory as it is stored in the machine. JavaScript, where we have typed arrays and ArrayBuffers (any array buffer can be mapped to typed array, for example you can determine endianess of a machine
), is one of such languages. You can access non-machine-sized members using setters and getters and iterate over collections using iterators.
Another ones are C++, C (with limitations), PHP (pack/unpack + wrapper) and Python (mmap). In C++ you can also iterate over sequences using overloaded [] operator.
Advantage of all of these is a more convenient and fast access to the data represented as memory s.a. memory-mapped files, network packets and different structs in memory.
The text was updated successfully, but these errors were encountered: