-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new pairwise indexed PAF adapter format with CLI creation workflow #3859
Conversation
random background to PR on going scalability concerns issues e.g. #2788 may help protocol-ize our synteny better a la current protocols paper (e.g. making sure the protocol can handle any size genome is good, otherwise have to explain limits more, something the jbrowse 2 paper did not do) also in response to feedback received in person at ismb where a user said they were not sure why some paf tracks took a long time |
I came up with two possible ideas that could create a "single file" that contains all the needed information
the output has both the querywise and the targetwise sorting in a single file, with the a literal prefix called "querywise" and "targetwise" preprended to all the refnames. another prefix for the overview could even be added. preparing this may require more code than the simple 6 line bash script at the top, but it would "reduce the mental overhead of juggling so many files to prepare the track" which may be worth it |
implemented the strategy mentioned above where both query and target are in a single file. used a script made here https://github.com/cmdcolin/pairwise_indexed_paf which could be added to the jbrowse cli tools probably |
4e9edfc
to
162eda7
Compare
Created a new command "jbrowse process-paf" with the intended usage being something like
or, can pipe minemap
I thought the file extension ppaf may help distinguish this "processed PAF" from the source one. That gives it a custom file extension that add-track can use to use the specialized PairwiseIndexedPAFAdapter for. The ppaf is still PAF format but has duplicated the info into separate "tabix query spaces" (one for query sorting, one with target sorting) |
note: potentially the above set of ~6 commands could be automated also, but also adds complexity to our program |
33968af
to
a7a9c49
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #3859 +/- ##
==========================================
+ Coverage 63.40% 63.43% +0.03%
==========================================
Files 1057 1061 +4
Lines 30787 30831 +44
Branches 7332 7356 +24
==========================================
+ Hits 19520 19559 +39
- Misses 11094 11100 +6
+ Partials 173 172 -1 ☔ View full report in Codecov by Sentry. |
a7a9c49
to
ed02bb9
Compare
ed02bb9
to
af3c717
Compare
af3c717
to
e5968d5
Compare
I refreshed this branch, as I think it has major benefits for visualizing PAF. especially use cases like viewing just the alignment in a small region, like @scottcain often has with loading up 6 or more syntenytracks of a small region. this can probably drastically reduce memory consumption, and make many synteny use cases more palatable whole genome overviews are not addressed by this yet, still would load a large amount of data, but the use case for loading just the synteny/alignment data for a region can be much improved by this PR |
03b511b
to
f26f7b0
Compare
also updated some of the comments posted previously with cli usage, etc. |
b6a2d3a
to
0ebb866
Compare
0ebb866
to
87209ca
Compare
I changed proposed file extension from ppaf to pif. I think it's a little less confusing perhaps to be a nice 3 letter extension and then added a new CLI command called "jbrowse make-pif file.paf" this will output file.pif.gz and file.pif.gz.tbi. optionally can supply a --out flag too |
the new "jbrowse make-pif" also runs the sort, bgzip and tabix commands automatically so it should streamline usage |
cd9d446
to
549ef8e
Compare
549ef8e
to
31d04e5
Compare
This proposes a new way of loading larger synteny datasets by pairwise tabix indexing a PAF file "query-wise" and "target-wise"
It requires a little bit of a command line setup but it enables much faster (and potentially more scalable) loading of synteny data
Example, converting the human vs mouse chain file to the pairwise indexed format
Session data size difference
The gzipped chain file which has to be loaded up front into memory, is hs1ToMm39.over.chain.gz 69Mb gzipped, 219Mb ungzipped.
With this branch, only 3.9Mb of bgzip data is downloaded for a session share link that looks at a large region of chr1 vs chr1 http://localhost:3000/?session=share-9-Dj3li4oS&config=test_data%2Fhs1_vs_mm39%2Fconfig.json&password=4vvbF
So, it is 5% of the data needed, a considerably reduction (~20x less)
Considerations
for the whole genome overview, we could maybe create a reduced PAF (e.g. strip CIGAR) that it semantically switches to
note that the meanQueryIdentity coloring mode may need to be pre-computed, since it requires global information to calculate the percent identity (it aggregates the mapping quality across all the pieces of the query sequence, even if it's split into multiple PAF lines)