Add new pairwise indexed PAF adapter format with CLI creation workflow #3859

cmdcolin · 2023-08-15T20:27:44Z

This proposes a new way of loading larger synteny datasets by pairwise tabix indexing a PAF file "query-wise" and "target-wise"

It requires a little bit of a command line setup but it enables much faster (and potentially more scalable) loading of synteny data

Example, converting the human vs mouse chain file to the pairwise indexed format

Session data size difference

The gzipped chain file which has to be loaded up front into memory, is hs1ToMm39.over.chain.gz 69Mb gzipped, 219Mb ungzipped.

With this branch, only 3.9Mb of bgzip data is downloaded for a session share link that looks at a large region of chr1 vs chr1 http://localhost:3000/?session=share-9-Dj3li4oS&config=test_data%2Fhs1_vs_mm39%2Fconfig.json&password=4vvbF

So, it is 5% of the data needed, a considerably reduction (~20x less)

Considerations

for the whole genome overview, we could maybe create a reduced PAF (e.g. strip CIGAR) that it semantically switches to

note that the meanQueryIdentity coloring mode may need to be pre-computed, since it requires global information to calculate the percent identity (it aggregates the mapping quality across all the pieces of the query sequence, even if it's split into multiple PAF lines)

cmdcolin · 2023-08-15T21:18:50Z

random background to PR

on going scalability concerns issues e.g. #2788

may help protocol-ize our synteny better a la current protocols paper (e.g. making sure the protocol can handle any size genome is good, otherwise have to explain limits more, something the jbrowse 2 paper did not do)

also in response to feedback received in person at ismb where a user said they were not sure why some paf tracks took a long time

cmdcolin · 2023-08-23T13:18:58Z

I came up with two possible ideas that could create a "single file" that contains all the needed information

in order to create an "overview" (whole genome dotplot) we can either strip the CIGAR but will inaccurately "bridge across" large insertions and deletions. to remedy we can, as a pre-processing script, scan across the CIGAR string, and if we encounter a insertion or deletion that "would be visible" at a large overview scale (say, 100,000bp just for example but could be dynamically calculated), then we split the feature into two: one going from start of feature to start of deletion, another going from after deletion to end of feature. this allows us to create an "overview" paf that is not as lossy as completely stripping the CIGAR
the configuration for this track ends up producing many files currently, already 4 files (query.paf.gz, query.paf.gz.tbi, target.paf.gz, target.paf.gz.tbi) without the overview paf which would be ~5 ( the overview wouldnt be tabix indexed probably since it is downloaded in full). this is a lot of files to maintain for a 'single synteny track'. instead, we could make a single tabix file that has a specialized encoding. example screenshot illustrating encoding

the output has both the querywise and the targetwise sorting in a single file, with the a literal prefix called "querywise" and "targetwise" preprended to all the refnames. another prefix for the overview could even be added. preparing this may require more code than the simple 6 line bash script at the top, but it would "reduce the mental overhead of juggling so many files to prepare the track" which may be worth it

cmdcolin · 2023-08-30T13:16:25Z

implemented the strategy mentioned above where both query and target are in a single file. used a script made here https://github.com/cmdcolin/pairwise_indexed_paf which could be added to the jbrowse cli tools probably

cmdcolin · 2023-09-08T19:34:14Z

Created a new command "jbrowse process-paf" with the intended usage being something like

minimap2 grape.fa peach.fa > out.paf
jbrowse process-paf out.ppaf | sort -k1,1 -k3,3n |bgzip> out.sorted.ppaf.gz
tabix -s1 -b3 -e4 out.sorted.ppaf.gz
jbrowse add-track out.sorted.ppaf.gz -a peach,grape

or, can pipe minemap

minimap2 grape.fa peach.fa| jbrowse process-paf | sort -k1,1 -k3,3n |bgzip> out.sorted.ppaf.gz
tabix -s1 -b3 -e4 out.sorted.ppaf.gz
jbrowse add-track out.sorted.ppaf.gz -a peach,grape

I thought the file extension ppaf may help distinguish this "processed PAF" from the source one. That gives it a custom file extension that add-track can use to use the specialized PairwiseIndexedPAFAdapter for. The ppaf is still PAF format but has duplicated the info into separate "tabix query spaces" (one for query sorting, one with target sorting)

cmdcolin · 2023-09-08T19:35:26Z

note: potentially the above set of ~6 commands could be automated also, but also adds complexity to our program

codecov · 2023-09-11T19:33:16Z

Codecov Report

Attention: 66 lines in your changes are missing coverage. Please review.

Comparison is base (9f69cb5) 63.40% compared to head (b6a2d3a) 63.43%.

❗ Current head b6a2d3a differs from pull request most recent head 31d04e5. Consider uploading reports for the commit 31d04e5 to get more accurate results

Files	Patch %	Lines
...wiseIndexedPAFAdapter/PairwiseIndexedPAFAdapter.ts	0.00%	51 Missing ⚠️
...ntenyDisplay/components/LinearSyntenyRendering.tsx	0.00%	7 Missing ⚠️
products/jbrowse-cli/src/commands/add-track.ts	50.00%	3 Missing ⚠️
products/jbrowse-cli/src/commands/process-paf.ts	95.34%	2 Missing ⚠️
...ters/src/PairwiseIndexedPAFAdapter/configSchema.ts	50.00%	1 Missing ⚠️
...ve-adapters/src/PairwiseIndexedPAFAdapter/index.ts	75.00%	1 Missing ⚠️
plugins/comparative-adapters/src/index.ts	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3859      +/-   ##
==========================================
+ Coverage   63.40%   63.43%   +0.03%     
==========================================
  Files        1057     1061       +4     
  Lines       30787    30831      +44     
  Branches     7332     7356      +24     
==========================================
+ Hits        19520    19559      +39     
- Misses      11094    11100       +6     
+ Partials      173      172       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cmdcolin · 2023-12-05T02:09:14Z

I refreshed this branch, as I think it has major benefits for visualizing PAF. especially use cases like viewing just the alignment in a small region, like @scottcain often has with loading up 6 or more syntenytracks of a small region. this can probably drastically reduce memory consumption, and make many synteny use cases more palatable

whole genome overviews are not addressed by this yet, still would load a large amount of data, but the use case for loading just the synteny/alignment data for a region can be much improved by this PR

cmdcolin · 2023-12-05T02:11:26Z

also updated some of the comments posted previously with cli usage, etc.

Yarn upgrade Misc Human vs mouse example Misc Misc Updates Misc Misc

cmdcolin · 2023-12-11T15:34:00Z

I changed proposed file extension from ppaf to pif. I think it's a little less confusing perhaps to be a nice 3 letter extension

and then added a new CLI command called "jbrowse make-pif file.paf"

this will output file.pif.gz and file.pif.gz.tbi. optionally can supply a --out flag too

cmdcolin · 2023-12-11T15:34:26Z

the new "jbrowse make-pif" also runs the sort, bgzip and tabix commands automatically so it should streamline usage

github-actions bot added the needs label triage Needs a label to show in changelog (breaking, enhancement, bug, documentation, or internal) label Aug 15, 2023

cmdcolin force-pushed the pairwise_paf branch from 3c6d507 to c91b1a9 Compare August 15, 2023 20:28

cmdcolin added enhancement New feature or request and removed needs label triage Needs a label to show in changelog (breaking, enhancement, bug, documentation, or internal) labels Aug 15, 2023

cmdcolin force-pushed the pairwise_paf branch from c91b1a9 to 251a4c9 Compare August 15, 2023 20:37

cmdcolin force-pushed the pairwise_paf branch from 6ec6652 to a25d752 Compare August 30, 2023 04:13

cmdcolin force-pushed the pairwise_paf branch 2 times, most recently from 4e9edfc to 162eda7 Compare September 8, 2023 19:29

cmdcolin force-pushed the pairwise_paf branch 2 times, most recently from 33968af to a7a9c49 Compare September 11, 2023 19:07

cmdcolin force-pushed the pairwise_paf branch from a7a9c49 to ed02bb9 Compare September 18, 2023 13:46

cmdcolin mentioned this pull request Oct 24, 2023

JBrowse 2 roadmap 2024 #3974

Closed

cmdcolin force-pushed the pairwise_paf branch from ed02bb9 to af3c717 Compare October 24, 2023 05:28

cmdcolin force-pushed the pairwise_paf branch from af3c717 to e5968d5 Compare December 5, 2023 01:45

cmdcolin force-pushed the pairwise_paf branch from 03b511b to f26f7b0 Compare December 5, 2023 02:10

cmdcolin added 7 commits December 11, 2023 00:29

Testing

c3b594c

Yarn upgrade Misc Human vs mouse example Misc Misc Updates Misc Misc

Add jbrowse CLI command process-paf

66e372f

[skip ci] Update snaps

30e6396

Updates

d873d7b

Misc

3795f73

Mis

d28d55e

Rename ppaf->pif, rename process-paf to create-pif, add new create-pifgz

5bd7de7

cmdcolin added 3 commits December 11, 2023 09:18

Remove separation between create-pif and create-pifgz

b03f81e

Misc

522851f

Misc

1f3ca2d

cmdcolin force-pushed the pairwise_paf branch from b6a2d3a to 0ebb866 Compare December 11, 2023 15:30

Create the jbrowse pif command

87209ca

cmdcolin force-pushed the pairwise_paf branch from 0ebb866 to 87209ca Compare December 11, 2023 15:32

cmdcolin force-pushed the pairwise_paf branch from cd9d446 to 549ef8e Compare December 11, 2023 16:00

Add bgzip and tabix to testing

31d04e5

cmdcolin force-pushed the pairwise_paf branch from 549ef8e to 31d04e5 Compare December 11, 2023 16:09

cmdcolin merged commit e11b16b into main Dec 11, 2023
10 checks passed

cmdcolin changed the title ~~Pairwise indexed PAF adapter proposal~~ Add new pairwise indexed PAF adapter format with CLI creation workflow Dec 11, 2023

cmdcolin deleted the pairwise_paf branch December 11, 2023 16:46

cmdcolin mentioned this pull request Dec 13, 2023

v2.10.0 release announce #4131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new pairwise indexed PAF adapter format with CLI creation workflow #3859

Add new pairwise indexed PAF adapter format with CLI creation workflow #3859

cmdcolin commented Aug 15, 2023 •

edited

Loading

cmdcolin commented Aug 15, 2023

cmdcolin commented Aug 23, 2023

cmdcolin commented Aug 30, 2023

cmdcolin commented Sep 8, 2023 •

edited

Loading

cmdcolin commented Sep 8, 2023

codecov bot commented Sep 11, 2023 •

edited

Loading

cmdcolin commented Dec 5, 2023

cmdcolin commented Dec 5, 2023

cmdcolin commented Dec 11, 2023

cmdcolin commented Dec 11, 2023

Add new pairwise indexed PAF adapter format with CLI creation workflow #3859

Add new pairwise indexed PAF adapter format with CLI creation workflow #3859

Conversation

cmdcolin commented Aug 15, 2023 • edited Loading

Session data size difference

Considerations

cmdcolin commented Aug 15, 2023

cmdcolin commented Aug 23, 2023

cmdcolin commented Aug 30, 2023

cmdcolin commented Sep 8, 2023 • edited Loading

cmdcolin commented Sep 8, 2023

codecov bot commented Sep 11, 2023 • edited Loading

Codecov Report

cmdcolin commented Dec 5, 2023

cmdcolin commented Dec 5, 2023

cmdcolin commented Dec 11, 2023

cmdcolin commented Dec 11, 2023

cmdcolin commented Aug 15, 2023 •

edited

Loading

cmdcolin commented Sep 8, 2023 •

edited

Loading

codecov bot commented Sep 11, 2023 •

edited

Loading