Improve scalability of synteny datasets #2788

cmdcolin · 2022-03-02T23:27:59Z

Currently synteny datasets are loaded into memory entirely. The Dot tool (https://dot.sandbox.bio/?coords=https://storage.googleapis.com/sandbox.bio/dot/gorilla_to_GRCh38.coords&index=https://storage.googleapis.com/sandbox.bio/dot/gorilla_to_GRCh38.coords.idx&annotations=https://storage.googleapis.com/sandbox.bio/dot/gencode.v27.genes.bed) has a method to filter out unique and repetitive elements via a pre-processing script. We could consider loading this data format. Other tools like D-GENIES also filter out small alignments by default

cmdcolin · 2022-03-02T23:34:59Z

Possible motivating example: gorilla vs hg38

https://hgdownload.cse.ucsc.edu/goldenpath/hg38/vsGorGor3/

Chain file is 482MB which unzips to almost 2GB, which is pretty much too large for interactive use

The Dot tool has the button for "Load unique" and "Load repetitive" buttons I think to avoid loading the entire thing at once, which could be one strategy (requires preprocessing of the .delta file from mummer with their python script)

cmdcolin · 2023-01-09T22:08:43Z

porting some comments from #3444

Scalability brainstorming

The hs1 vs mm39 liftover.chain.gz file that we use for the SyntenyTrack is 69MB of gzip data, which is 219MB ungzipped. The maximum size of ungzipped data we support is 512MB due to that being the maximum size of strings in chrome, so it comes pretty close to our limits. I can certainly imagine species (plant genomes, etc) that would exceed our limits.

An indexed file format could help us in some cases. We have not thus far focused on indexed file formats, because we were using somewhat small PAF files that could be loaded into memory but scalability concerns are referenced here #2788. But, with indexing, we may not need to download the entire file when accessing a local region on the LGV synteny track (currently, synteny track adapters generally download the entirety of the file. this is an adapter behavior that could be adjusted for)

The bigChain format from UCSC could possibly help as an example of an indexed file format, it is only indexed in "one dimension" e.g. for the query genome and not the target genome, so accessing the data from the target genome would be unindexed. A custom tabix-y style chain format can probably be made also, similar to mafviewer. "2-D" indexed formats would be cool, but may not be available. Also, "biologically", it may be better to have two tracks: "hs1 (query) vs mm39 (target)" and "mm39 (query) vs hs1 (target)", which would mean the 1D indexing is fine.

cmdcolin · 2023-02-23T21:01:28Z

here is an example of a human vs mouse dataset. it currently stretches the limits of our scalability, and crashes some browsers. http://jbrowse.org/code/jb2/main/?config=test_data%2Fhuman_vs_mouse.json&session=share-cJ_p38pUpD&password=z4P70

has a 68MB gzipped liftover file as datasource, that uncompresses to 219MB ungzipped data

cmdcolin · 2023-07-19T17:29:45Z

indexed db
bidirectional tabix
multi-scale resolution
cli helper, or maybe in browser helper

cmdcolin · 2023-07-19T18:10:54Z

server side helper: server could automatically optimize bidirectional indexes, multi-scale resolution, and potentially extend to graph type genome more readily

cmdcolin · 2024-03-12T19:23:04Z

somewhat solved by PIF. can revisit in further use cases for large zoom outs

cmdcolin added the enhancement New feature or request label Mar 2, 2022

cmdcolin mentioned this issue Jan 9, 2023

Human vs mouse synteny demo #3444

Merged

cmdcolin added the high impact label Feb 23, 2023

cmdcolin mentioned this issue Aug 15, 2023

Add new pairwise indexed PAF adapter format with CLI creation workflow #3859

Merged

cmdcolin added the applied label Nov 22, 2023

cmdcolin closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scalability of synteny datasets #2788

Improve scalability of synteny datasets #2788

cmdcolin commented Mar 2, 2022

cmdcolin commented Mar 2, 2022

cmdcolin commented Jan 9, 2023

cmdcolin commented Feb 23, 2023

cmdcolin commented Jul 19, 2023

cmdcolin commented Jul 19, 2023

cmdcolin commented Mar 12, 2024

Improve scalability of synteny datasets #2788

Improve scalability of synteny datasets #2788

Comments

cmdcolin commented Mar 2, 2022

cmdcolin commented Mar 2, 2022

cmdcolin commented Jan 9, 2023

Scalability brainstorming

cmdcolin commented Feb 23, 2023

cmdcolin commented Jul 19, 2023

cmdcolin commented Jul 19, 2023

cmdcolin commented Mar 12, 2024