Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO: save the search result into a serializing binary file for fast downstream parsing #40

Open
shenwei356 opened this issue Sep 4, 2023 · 3 comments

Comments

@shenwei356
Copy link
Owner

shenwei356 commented Sep 4, 2023

The current tab-delimited search result format is redundant and inefficient for parsing in kmcp profile. So we can use a compact binary format to save the temporary result.

  1. kmcp search: a flag -b/--binary-outpu would be added to choose the output format optionally.
  2. A new command kmcp view should be added to convert the binary to plain text format.
  3. kmcp merge needs to be compatible with both plain and binary formats.
  4. kmcp profile needs to be compatible with both plain and binary formats.
#query qLen qKmers FPR hits target chunkIdx chunks tLen kSize mKmers qCov tCov jacc queryIdx
read_1 150 130 7.4626e-15 1 GCF_000007805.1 2 10 6397126 21 130 1.0000 0.0002 0.0002 0
read_2 150 130 7.4626e-15 1 GCF_000007805.1 8 10 6397126 21 130 1.0000 0.0002 0.0002 1
read_3 150 130 7.4626e-15 1 GCF_000003835.1 8 10 12115052 21 130 1.0000 0.0001 0.0001 2
read_4 150 130 7.4626e-15 1 GCF_000003835.1 3 10 12115052 21 130 1.0000 0.0001 0.0001 3
@ericvdtoorn
Copy link

The current tab-delimited search result format is redundant and inefficient for parsing in kmcp profile. So we can use a compact binary format to save the temporary result.

  1. kmcp search: a flag -b/--binary-output would be added to choose the output format optionally.

Would it not be better to infer from the output extension which is usually specified? Make it a .kmcp file or something similar.

@ericvdtoorn
Copy link

ericvdtoorn commented Sep 4, 2023

|#query|qLen|qKmers|FPR |hits|target |chunkIdx|chunks|tLen |kSize|mKmers|qCov |tCov |jacc |queryIdx|

|:-----|:---|:-----|:---------|:---|:--------------|:-------|:-----|:-------|:----|:-----|:-----|:-----|:-----|:-------|

|read_1|150 |130 |7.4626e-15|1 |GCF_000007805.1|2 |10 |6397126 |21 |130 |1.0000|0.0002|0.0002|0 |

|read_2|150 |130 |7.4626e-15|1 |GCF_000007805.1|8 |10 |6397126 |21 |130 |1.0000|0.0002|0.0002|1 |

|read_3|150 |130 |7.4626e-15|1 |GCF_000003835.1|8 |10 |12115052|21 |130 |1.0000|0.0001|0.0001|2 |

|read_4|150 |130 |7.4626e-15|1 |GCF_000003835.1|3 |10 |12115052|21 |130 |1.0000|0.0001|0.0001|3 |

Empirically, few of these fields would require an int64 (at least none were close to int32 in a practical file) so that could also be potential space saving

Edit: meant that int32 would probably be enough rather than int64

@shenwei356
Copy link
Owner Author

Would it not be better to infer from the output extension which is usually specified? Make it a .kmcp file or something similar.

Yes, we can make the binary format the default output, and make the plain text format optional.

Empirically, few of these fields would require an int32 (at least none were close in a practical file) so that could also be potential space saving

Right. I'll carefully consider it later. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants