Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-stream API #65

Open
KOLANICH opened this issue Dec 13, 2016 · 9 comments
Open

Non-stream API #65

KOLANICH opened this issue Dec 13, 2016 · 9 comments

Comments

@KOLANICH
Copy link

KOLANICH commented Dec 13, 2016

In some languages the goals can be achieved without stream API. The example of such language is any language allowing accessing memory as it is stored in the machine. JavaScript, where we have typed arrays and ArrayBuffers (any array buffer can be mapped to typed array, for example you can determine endianess of a machine

let a=Uint8Array.from((function*(){for(let i=0;i<16;i+=2){yield (i<<4)|(i+1);}})())
console.log(Array.from(a).map(n=>n.toString(16)));
console.log(Array.from(new Uint16Array(a.buffer)).map(n=>n.toString(16)));
console.log(Array.from(new Uint32Array(a.buffer)).map(n=>n.toString(16)));
//console.log(Array.from(new Uint64Array(a.buffer)).map(n=>n.toString(16)));

), is one of such languages. You can access non-machine-sized members using setters and getters and iterate over collections using iterators.
Another ones are C++, C (with limitations), PHP (pack/unpack + wrapper) and Python (mmap). In C++ you can also iterate over sequences using overloaded [] operator.
Advantage of all of these is a more convenient and fast access to the data represented as memory s.a. memory-mapped files, network packets and different structs in memory.

@GreyCat
Copy link
Member

GreyCat commented Dec 13, 2016

It's still some API, although different implementation. You can take a look at Java, for example: it used one set of underlying classes first, then we've switched to current set (which uses memory-mapped files internally).

If you want to implement any such API in another input paradigm / tooling — you're welcome to go ahead and contribute :)

If you want to avoid method calls to such API and instead embed API calls into the target code — that's possible too, we've already discussed a possibility to do such "static linking", i.e. instead of:

foo = _io.readS1();
bar = _io.readS4be();
baz = _io.readStrByteLimit(16, "UTF-8");

we'll just inline runtime code right inside the generated code:

foo = _bb.get();
_bb.order(ByteOrder.BIG_ENDIAN);
bar = _bb.getInt();
{
    byte[] _buf = new byte[16];
    bb.get(_buf);
    baz = new String(_buf, Charset.forName("UTF-8"));
}

I doubt that it will be any faster (C++/C#/Java can do proper inlining automatically), but probably it could save some hassle to do proper linking of runtime library, etc.

@KOLANICH
Copy link
Author

KOLANICH commented Mar 16, 2017

Not quite this, buy a very related issue.

For now KS parses whole stream and stores it in memory. Imagine what will happen if you try to parse multi-GiB file with it. The solution is memory-mapped files. But KS is not memory friendly. So we need another stream API model. For every parsed KSc creates code creating a memory map object. This object is stored in a structure currently parsed and contains offset of the field and its length. Then KSc creates properties (an object behaving like value but instead calling accessors, it is possible even in c++) which take the offset from memory map and get the data by offset. The OS will take care about the rest.

@GreyCat
Copy link
Member

GreyCat commented Mar 16, 2017

There is some support for lazy parsing (take a look at any filesystem implementation) — it allows you to not "parse whole stream and store it in memory". Generally, the question is solved easily if your file has some sort of index, thus you can create an API that will use this index to parse file parts on demand. If you just have a long stream of structures of different sizes, you usually work with them differently, i.e. you can blindly seek to a middle of the file and try to find next "frame" or whatever by some sync bytes. For example, if you want individual frames in a certain media codec, you can't afford to parse them all just to build an index of offsets/sizes of each frame.

The solution is memory-mapped files.

Java runtime uses memory-mapped files. I don't quite understand how exactly it would help — if I got you right, memorizing offset+size and re-reading the stuff can be done with regular file APIs as well.

@KOLANICH
Copy link
Author

KOLANICH commented Mar 17, 2017

Generally, the question is solved easily if your file has some sort of index, thus you can create an API that will use this index to parse file parts on demand.

Some formats don't have index table, but have a header (without any offsets) and continious stream of structures without sync bytes (but often these structures have fixed size, this can be used for optimization, if the compiler can derive it).

if I got you right, memorizing offset+size and re-reading the stuff can be done with regular file APIs as well.

Yes, it can. But at least for python (I haven't looked into other runtimes) this cannot be done without much perversions. The problem is that:

  • the compiler generates reading with with read_* methods of KaitaiStream in constructor. At the time constructor are called we cannot create properties (but we can create tables). We should create all the properties ahead of time in class definition.
  • it doesn't pass a reference to current object to read_* methods, so they cannot be used to populate the table
  • the workaround is to redefine the functions to return a special structure, then walk the class dict, find all the properties of special type and move them to a table, but .... this way also needs either compiler support or manual finishing
  • the workaround of manual action is to insert into getattr a check and do this lazily but
  • this all I consider as perversions. I think It'd be better if the compiler generated:
    1 Table-generation code in constructors.
    2 @property-decorated methods.

@GreyCat
Copy link
Member

GreyCat commented Mar 17, 2017

Some formats don't have index table, but have a header (without any offsets) and continious stream of structures without sync bytes (but often these structures have fixed size, this can be used for optimization, if the compiler can derive it).

Ok, fair enough. That's a separate issue with a separate solution — i.e. "lazy parsing of arrays with fixed element sizes", which can be achieved by some sparse array data structure + arithmetics. I'll create an issue now.

this cannot be done without much perversions.

Sorry, but I don't really understand what problem you're trying to solve here in the first place. It looks like you're trying to do "do some lazy parsing by changing runtime without touching the compiler and generated code". I'm proposing the exact opposite: "change compiler (and generated code) to reuse existing runtime". For example, let's take a problem "do lazy parsing of a structure with an apriori known size". Something like this:

seq:
  - id: foo
    type: u4
  - id: bar
    size: 100500
    type: something_else
    lazy: true
  - id: baz
    type: u4

Right now the following constructor body is generated (if we remove lazy):

        self.foo = self._io.read_u4le()
        self._raw_bar = self._io.read_bytes(100500)
        io = KaitaiStream(BytesIO(self._raw_bar))
        self.bar = self._root.SomethingElse(io, self, self._root)
        self.baz = self._io.read_u4le()

I'm implying that we can generate something like that in constructor:

        self.foo = self._io.read_u4le()
        self._ofs_bar = self._io.pos() # remember position
        self._io.seek(self._io.pos() + 100500) # just skip over these many bytes
        self.baz = self._io.read_u4le()

and, in addition, add a special lazy getter which is very similar to what's generated now for instances:

    @property
    def bar(self):
        if hasattr(self, '_m_bar'):
            return self._m_bar if hasattr(self, '_m_bar') else None

        _pos = self._io.pos()
        self._io.seek(self._ofs_bar) # that position we remembered previously
        self._raw__m_bar = self._io.read_bytes(100500)
        io = KaitaiStream(BytesIO(self._raw__m_bar))
        self._m_bar = self._root.SomethingElse(io, self, self._root)
        self._io.seek(_pos)
        return self._m_bar if hasattr(self, '_m_bar') else None

The only line which is special here is the commented one. No need to change any runtimes, works with every target.

@KOLANICH
Copy link
Author

KOLANICH commented Mar 17, 2017 via email

@GreyCat
Copy link
Member

GreyCat commented Mar 17, 2017

1 We can make all the fields properties and not to use read_* directly in constructor, but rely on properties (their offsets and sizes should be already known prior getter invocation).

You're implying that we should make all seq attributes to be lazy? That's not really optimal, in many cases just reading them directly would be much faster than all that fiddling with backing variables, checks, seeks, etc.

2 We can right now introduce setters for primitive types. Since we don't have a symbolic engine for now, using of setters can be dangerous (can violate constraints), but it can be available right now (if we had writing functions) for use on own risk.

Sure, feel free to go ahead and plan / implement write_* methods — they would be helpful in future ;)

@KOLANICH
Copy link
Author

KOLANICH commented Mar 17, 2017

You're implying that we should make all seq attributes to be lazy?

Yes. This will also give us some kind of serialization.

That's not really optimal, in many cases just reading them directly would be much faster than all that fiddling with backing variables, checks, seeks, etc.

Yes, this will be slower. I see no way to solve this problem in python: numbers are value-types (immutable) there and a one cannot get a reference to them even using bytecode. So I guess this tradeof should be tweakable by compilation flags.

@arekbulski
Copy link
Member

I agree with @GreyCat . Lazy parsing is a separate issue and memory-mapped IO is unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants