Non-stream API #65

KOLANICH · 2016-12-13T21:39:51Z

In some languages the goals can be achieved without stream API. The example of such language is any language allowing accessing memory as it is stored in the machine. JavaScript, where we have typed arrays and ArrayBuffers (any array buffer can be mapped to typed array, for example you can determine endianess of a machine

let a=Uint8Array.from((function*(){for(let i=0;i<16;i+=2){yield (i<<4)|(i+1);}})())
console.log(Array.from(a).map(n=>n.toString(16)));
console.log(Array.from(new Uint16Array(a.buffer)).map(n=>n.toString(16)));
console.log(Array.from(new Uint32Array(a.buffer)).map(n=>n.toString(16)));
//console.log(Array.from(new Uint64Array(a.buffer)).map(n=>n.toString(16)));

), is one of such languages. You can access non-machine-sized members using setters and getters and iterate over collections using iterators.
Another ones are C++, C (with limitations), PHP (pack/unpack + wrapper) and Python (mmap). In C++ you can also iterate over sequences using overloaded [] operator.
Advantage of all of these is a more convenient and fast access to the data represented as memory s.a. memory-mapped files, network packets and different structs in memory.

GreyCat · 2016-12-13T23:21:36Z

It's still some API, although different implementation. You can take a look at Java, for example: it used one set of underlying classes first, then we've switched to current set (which uses memory-mapped files internally).

If you want to implement any such API in another input paradigm / tooling — you're welcome to go ahead and contribute :)

If you want to avoid method calls to such API and instead embed API calls into the target code — that's possible too, we've already discussed a possibility to do such "static linking", i.e. instead of:

foo = _io.readS1();
bar = _io.readS4be();
baz = _io.readStrByteLimit(16, "UTF-8");

we'll just inline runtime code right inside the generated code:

foo = _bb.get();
_bb.order(ByteOrder.BIG_ENDIAN);
bar = _bb.getInt();
{
    byte[] _buf = new byte[16];
    bb.get(_buf);
    baz = new String(_buf, Charset.forName("UTF-8"));
}

I doubt that it will be any faster (C++/C#/Java can do proper inlining automatically), but probably it could save some hassle to do proper linking of runtime library, etc.

KOLANICH · 2017-03-16T16:57:12Z

Not quite this, buy a very related issue.

For now KS parses whole stream and stores it in memory. Imagine what will happen if you try to parse multi-GiB file with it. The solution is memory-mapped files. But KS is not memory friendly. So we need another stream API model. For every parsed KSc creates code creating a memory map object. This object is stored in a structure currently parsed and contains offset of the field and its length. Then KSc creates properties (an object behaving like value but instead calling accessors, it is possible even in c++) which take the offset from memory map and get the data by offset. The OS will take care about the rest.

GreyCat · 2017-03-16T17:06:25Z

There is some support for lazy parsing (take a look at any filesystem implementation) — it allows you to not "parse whole stream and store it in memory". Generally, the question is solved easily if your file has some sort of index, thus you can create an API that will use this index to parse file parts on demand. If you just have a long stream of structures of different sizes, you usually work with them differently, i.e. you can blindly seek to a middle of the file and try to find next "frame" or whatever by some sync bytes. For example, if you want individual frames in a certain media codec, you can't afford to parse them all just to build an index of offsets/sizes of each frame.

The solution is memory-mapped files.

Java runtime uses memory-mapped files. I don't quite understand how exactly it would help — if I got you right, memorizing offset+size and re-reading the stuff can be done with regular file APIs as well.

KOLANICH · 2017-03-17T07:55:18Z

Generally, the question is solved easily if your file has some sort of index, thus you can create an API that will use this index to parse file parts on demand.

Some formats don't have index table, but have a header (without any offsets) and continious stream of structures without sync bytes (but often these structures have fixed size, this can be used for optimization, if the compiler can derive it).

if I got you right, memorizing offset+size and re-reading the stuff can be done with regular file APIs as well.

Yes, it can. But at least for python (I haven't looked into other runtimes) this cannot be done without much perversions. The problem is that:

the compiler generates reading with with read_* methods of KaitaiStream in constructor. At the time constructor are called we cannot create properties (but we can create tables). We should create all the properties ahead of time in class definition.
it doesn't pass a reference to current object to read_* methods, so they cannot be used to populate the table
the workaround is to redefine the functions to return a special structure, then walk the class dict, find all the properties of special type and move them to a table, but .... this way also needs either compiler support or manual finishing
the workaround of manual action is to insert into getattr a check and do this lazily but
this all I consider as perversions. I think It'd be better if the compiler generated:
1 Table-generation code in constructors.
2 @property-decorated methods.

GreyCat · 2017-03-17T12:03:48Z

Some formats don't have index table, but have a header (without any offsets) and continious stream of structures without sync bytes (but often these structures have fixed size, this can be used for optimization, if the compiler can derive it).

Ok, fair enough. That's a separate issue with a separate solution — i.e. "lazy parsing of arrays with fixed element sizes", which can be achieved by some sparse array data structure + arithmetics. I'll create an issue now.

this cannot be done without much perversions.

Sorry, but I don't really understand what problem you're trying to solve here in the first place. It looks like you're trying to do "do some lazy parsing by changing runtime without touching the compiler and generated code". I'm proposing the exact opposite: "change compiler (and generated code) to reuse existing runtime". For example, let's take a problem "do lazy parsing of a structure with an apriori known size". Something like this:

seq:
  - id: foo
    type: u4
  - id: bar
    size: 100500
    type: something_else
    lazy: true
  - id: baz
    type: u4

Right now the following constructor body is generated (if we remove lazy):

        self.foo = self._io.read_u4le()
        self._raw_bar = self._io.read_bytes(100500)
        io = KaitaiStream(BytesIO(self._raw_bar))
        self.bar = self._root.SomethingElse(io, self, self._root)
        self.baz = self._io.read_u4le()

I'm implying that we can generate something like that in constructor:

        self.foo = self._io.read_u4le()
        self._ofs_bar = self._io.pos() # remember position
        self._io.seek(self._io.pos() + 100500) # just skip over these many bytes
        self.baz = self._io.read_u4le()

and, in addition, add a special lazy getter which is very similar to what's generated now for instances:

    @property
    def bar(self):
        if hasattr(self, '_m_bar'):
            return self._m_bar if hasattr(self, '_m_bar') else None

        _pos = self._io.pos()
        self._io.seek(self._ofs_bar) # that position we remembered previously
        self._raw__m_bar = self._io.read_bytes(100500)
        io = KaitaiStream(BytesIO(self._raw__m_bar))
        self._m_bar = self._root.SomethingElse(io, self, self._root)
        self._io.seek(_pos)
        return self._m_bar if hasattr(self, '_m_bar') else None

The only line which is special here is the commented one. No need to change any runtimes, works with every target.

KOLANICH · 2017-03-17T12:37:19Z

It looks like you're trying to do "do some lazy parsing by changing runtime without touching the compiler and generated code".

I tried to do this first, but failed becase I strongly dislike perversions. So you see this issue raised. Your code is OK, but I wanna notice something. 1 We can make all the fields properties and not to use read_* directly in constructor, but rely on properties (their offsets and sizes should be already known prior getter invocation). 2 We can right now introduce setters for primitive types. Since we don't have a symbolic engine for now, using of setters can be dangerous (can violate constraints), but it can be available right now (if we had writing functions) for use on own risk.

GreyCat · 2017-03-17T12:48:47Z

1 We can make all the fields properties and not to use read_* directly in constructor, but rely on properties (their offsets and sizes should be already known prior getter invocation).

You're implying that we should make all seq attributes to be lazy? That's not really optimal, in many cases just reading them directly would be much faster than all that fiddling with backing variables, checks, seeks, etc.

2 We can right now introduce setters for primitive types. Since we don't have a symbolic engine for now, using of setters can be dangerous (can violate constraints), but it can be available right now (if we had writing functions) for use on own risk.

Sure, feel free to go ahead and plan / implement write_* methods — they would be helpful in future ;)

KOLANICH · 2017-03-17T15:38:40Z

You're implying that we should make all seq attributes to be lazy?

Yes. This will also give us some kind of serialization.

That's not really optimal, in many cases just reading them directly would be much faster than all that fiddling with backing variables, checks, seeks, etc.

Yes, this will be slower. I see no way to solve this problem in python: numbers are value-types (immutable) there and a one cannot get a reference to them even using bytecode. So I guess this tradeof should be tweakable by compilation flags.

arekbulski · 2018-01-18T07:23:21Z

I agree with @GreyCat . Lazy parsing is a separate issue and memory-mapped IO is unrelated.

GreyCat mentioned this issue Mar 17, 2017

Lazy parsing of sequence attributes #133

Open

KOLANICH mentioned this issue May 11, 2017

Document security expectations #170

Open

KOLANICH mentioned this issue Sep 25, 2017

Any port for plain C? #263

Open

KOLANICH mentioned this issue Jan 18, 2018

Cython runtime and compile target #311

Closed

KOLANICH mentioned this issue Aug 23, 2018

Silent failure when a field references a subsequent field #469

Open

KOLANICH mentioned this issue Nov 20, 2018

Add rekordbox file format specs. kaitai-io/kaitai_struct_formats#116

Open

GreyCat added this to the Low priority milestone Mar 28, 2019

KOLANICH mentioned this issue Mar 2, 2020

Kaitai and scientific data #711

Open

This was referenced Mar 27, 2021

Question: Instruct kaitai to read the stream but not to store byte array #867

Closed

Memory consumption problem in large file parsing. #866

Closed

KOLANICH mentioned this issue May 16, 2021

Rendering formats from MMD (MikuMikuDance) kaitai-io/kaitai_struct_formats#446

Open

KOLANICH mentioned this issue Apr 25, 2022

reducing I/O by using buffers and memoryviews kaitai-io/kaitai_struct_python_runtime#67

Open

generalmimon added the enhancement label Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-stream API #65

Non-stream API #65

KOLANICH commented Dec 13, 2016 •

edited

Loading

GreyCat commented Dec 13, 2016

KOLANICH commented Mar 16, 2017 •

edited

Loading

GreyCat commented Mar 16, 2017

KOLANICH commented Mar 17, 2017 •

edited

Loading

GreyCat commented Mar 17, 2017

KOLANICH commented Mar 17, 2017 via email •

edited

Loading

GreyCat commented Mar 17, 2017

KOLANICH commented Mar 17, 2017 •

edited

Loading

arekbulski commented Jan 18, 2018

Non-stream API #65

Non-stream API #65

Comments

KOLANICH commented Dec 13, 2016 • edited Loading

GreyCat commented Dec 13, 2016

KOLANICH commented Mar 16, 2017 • edited Loading

GreyCat commented Mar 16, 2017

KOLANICH commented Mar 17, 2017 • edited Loading

GreyCat commented Mar 17, 2017

KOLANICH commented Mar 17, 2017 via email • edited Loading

GreyCat commented Mar 17, 2017

KOLANICH commented Mar 17, 2017 • edited Loading

arekbulski commented Jan 18, 2018

KOLANICH commented Dec 13, 2016 •

edited

Loading

KOLANICH commented Mar 16, 2017 •

edited

Loading

KOLANICH commented Mar 17, 2017 •

edited

Loading

KOLANICH commented Mar 17, 2017 via email •

edited

Loading

KOLANICH commented Mar 17, 2017 •

edited

Loading