TODO: save the search result into a serializing binary file for fast downstream parsing #40

shenwei356 · 2023-09-04T01:32:04Z

The current tab-delimited search result format is redundant and inefficient for parsing in kmcp profile. So we can use a compact binary format to save the temporary result.

kmcp search: a flag -b/--binary-outpu would be added to choose the output format optionally.
A new command kmcp view should be added to convert the binary to plain text format.
kmcp merge needs to be compatible with both plain and binary formats.
kmcp profile needs to be compatible with both plain and binary formats.

#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
read_1	150	130	7.4626e-15	1	GCF_000007805.1	2	10	6397126	21	130	1.0000	0.0002	0.0002	0
read_2	150	130	7.4626e-15	1	GCF_000007805.1	8	10	6397126	21	130	1.0000	0.0002	0.0002	1
read_3	150	130	7.4626e-15	1	GCF_000003835.1	8	10	12115052	21	130	1.0000	0.0001	0.0001	2
read_4	150	130	7.4626e-15	1	GCF_000003835.1	3	10	12115052	21	130	1.0000	0.0001	0.0001	3

The text was updated successfully, but these errors were encountered:

ericvdtoorn · 2023-09-04T06:20:32Z

The current tab-delimited search result format is redundant and inefficient for parsing in kmcp profile. So we can use a compact binary format to save the temporary result.

kmcp search: a flag -b/--binary-output would be added to choose the output format optionally.

Would it not be better to infer from the output extension which is usually specified? Make it a .kmcp file or something similar.

ericvdtoorn · 2023-09-04T06:23:49Z

|#query|qLen|qKmers|FPR |hits|target |chunkIdx|chunks|tLen |kSize|mKmers|qCov |tCov |jacc |queryIdx|

|:-----|:---|:-----|:---------|:---|:--------------|:-------|:-----|:-------|:----|:-----|:-----|:-----|:-----|:-------|

|read_1|150 |130 |7.4626e-15|1 |GCF_000007805.1|2 |10 |6397126 |21 |130 |1.0000|0.0002|0.0002|0 |

|read_2|150 |130 |7.4626e-15|1 |GCF_000007805.1|8 |10 |6397126 |21 |130 |1.0000|0.0002|0.0002|1 |

|read_3|150 |130 |7.4626e-15|1 |GCF_000003835.1|8 |10 |12115052|21 |130 |1.0000|0.0001|0.0001|2 |

|read_4|150 |130 |7.4626e-15|1 |GCF_000003835.1|3 |10 |12115052|21 |130 |1.0000|0.0001|0.0001|3 |

Empirically, few of these fields would require an int64 (at least none were close to int32 in a practical file) so that could also be potential space saving

Edit: meant that int32 would probably be enough rather than int64

shenwei356 · 2023-09-04T07:53:05Z

Would it not be better to infer from the output extension which is usually specified? Make it a .kmcp file or something similar.

Yes, we can make the binary format the default output, and make the plain text format optional.

Empirically, few of these fields would require an int32 (at least none were close in a practical file) so that could also be potential space saving

Right. I'll carefully consider it later. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO: save the search result into a serializing binary file for fast downstream parsing #40

TODO: save the search result into a serializing binary file for fast downstream parsing #40

shenwei356 commented Sep 4, 2023 •

edited

Loading

ericvdtoorn commented Sep 4, 2023

ericvdtoorn commented Sep 4, 2023 •

edited

Loading

shenwei356 commented Sep 4, 2023

TODO: save the search result into a serializing binary file for fast downstream parsing #40

TODO: save the search result into a serializing binary file for fast downstream parsing #40

Comments

shenwei356 commented Sep 4, 2023 • edited Loading

ericvdtoorn commented Sep 4, 2023

ericvdtoorn commented Sep 4, 2023 • edited Loading

shenwei356 commented Sep 4, 2023

shenwei356 commented Sep 4, 2023 •

edited

Loading

ericvdtoorn commented Sep 4, 2023 •

edited

Loading