Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Instruct kaitai to read the stream but not to store byte array #867

Closed
kmrsh opened this issue Mar 27, 2021 · 4 comments
Closed
Labels

Comments

@kmrsh
Copy link

kmrsh commented Mar 27, 2021

We have a situation where large number of records (hence the large size too) come inside single file , we only need to process two types of events and we need to ignore other types. We can simply provide unsupported kaitai struct with the size but it still store those byte array in the memory. This is huge in our case, is there a way to tell kaitai struct that it needs to advance the stream based on the size given but not to store those bytes in instance variable

ex: let's say way have the following 2 event and one we don't care

  t_event_A_body:
    params:
      - id: body_length
        type: u2
    seq:
      - id: data                      # We need this data so 'data' property should have the byte array of size 'body_length'
        size: body_length         

  t_event_B_body:
    params:
      - id: body_length
        type: u2
    seq:
      - id: data                       # We need this data so 'data' property should have the byte array of size 'body_length'
        size: body_length          

  t_event_unsupported_body:
    params:
      - id: body_length
        type: u2
    seq:
      - id: data
        size: body_length         # Now here we don't care about this data and we need 'data' to be empty to save memory but stream should advance to next event record using the size ('body_length') provided, so that it correctly parse the subsequent events
@KOLANICH
Copy link

I guess it can be done with pos instances (pos instances have memory copied into an own buffer before parsing only if size is present), but it is ma dirty, hacky and incorrect solution.

meta:
  id: format
seq:
  - type: t_event
    repeat: eos

types:
  t_event:
    seq:
      - id: type
        type: u1
      - id: body_length
        type: u2
      - id: body
        type:
          switch-on: type
          ...
        if: is_supported
      instances:
        is_supported:
          value: type != type::unsupported
        next:
          pos: _io.pos + body_length
          type: format
          if: !is_supported

So for the last item in the seq you would have to check if it is_supported, and if it is not, descend into next.

BTW, what is your format?

@kmrsh
Copy link
Author

kmrsh commented Mar 27, 2021

@KOLANICH we have several formats related to telecommunication industry, these are proprietary formats and files are fairly large, around 100mb and main issue is the velocity of the incoming data that we have to process. Usually certain applications care about certain types of records in these files, so majority of the time it only requires 10% of the data during the process, other 90% of the data is not required, (other data may be relevant in different process). Issue is we can keep the memory footprint very low and process these files efficiently if there's a way to ignore these unused bytes during the process. note that, files are flowing around 10K-50K or more per second, and even more.

I think this is very simple implementation to do, if kaitai can support this in future releases,
say we defined something like this

  t_event_unsupported_body:
    params:
      - id: body_length
        type: u2
    seq:
      - id: data
        size: body_length
        ignore: true   # introduce something like this

Then simply, in the generated code

// instead of the following
// this.data = this._io.readBytes((bodyLength()));
// we simply use this
this._io.readBytes((bodyLength()));

Even now we can modify the generated code to do this, but it is not the best option.

@KOLANICH
Copy link

KOLANICH commented Mar 27, 2021

we have several formats related to telecommunication industry, these are proprietary formats and files are fairly large, around 100mb and main issue is the velocity of the incoming data that we have to process.

Is it DPI?

Issue is we can keep the memory footprint very low and process these files efficiently

I don't think KS suits here currently. I tried to parse 2 GiB file into KS (a Qt Installer Framework installer, just an index pointing to 7z archives within a file (which were not parsed, I only needed their offsets and sizes to be able to generate dd commands to dump them as separate files)) and got 12 GiB memory footprint for C++ runtime even after I have patched the generated source manually to discard everything possible. Though an important detail: I did it on Linux, and haven't tested on Windows (the whole spec was motivated by the need to unpack QIF-based installers on Linux since they didn't work properly in Wine), so it may be a yet another manifestation of 12309.

I think this is very simple implementation to do, if kaitai can support this in future releases,
say we defined something like this

Requested for a long time. #88 and #525. Also you may find #65 useful. But all of these are long-stalled, the leading devs have too little time, I don't want to touch Scala. Please consider implementing them yourself.

@kmrsh
Copy link
Author

kmrsh commented Mar 27, 2021

Thanks for information, so far kaitai performing well in our scenario, just wanted to check , is there something that I missed out. Worst case we could change the code do post processing on the generated code to ignore the unused byte storage initialization. I will certainly check the issues you mentioned. Thanks a lot for your time taking to answer the questions, really appreciate it and also the effort to make such great framework for parsing binary files

@kmrsh kmrsh closed this as completed Mar 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants