Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRGTdb #6

Closed
wants to merge 102 commits into from
Closed

TRGTdb #6

wants to merge 102 commits into from

Conversation

ACEnglish
Copy link

Adding code for coverting a TRGT output VCF into a database. See tdb_tutorial.md for usage details.

TODOs:

  • This code uses truvari v4.0-dev (min commit 6edb54f). This requirement is because there were changes made to truvari.vcf2df. Until v4.0 is released, truvari will need to be manually installed. After truvari v4.0 is cut, we can simply uncomment the line from trgt/setup.py that installs it (line 31).
  • Currently, the trgt python package is versioned as 0.0.1. This should be synchronized with the main trgt executables' version.
  • trgt.database.dbutils.pull_saps assumes the allele length range is stored in the vcf as FORMAT/ALLR. However trgt v0.3.4 writes FORMAT/ALCI. Therefore, this code isn't compatible with trgt v0.3.4.
  • There is a placeholder in trgt.__main__ for wrapping the trgt main executable. If we want to distribute trgt with a single command line interface (e.g. trgt run, trgt viz, trgt db), we'll need to place the executables into the repository, update MANIFEST.in to package those executables, and then make external calls from run_main (e.g. Popen(os.path.join(trgt.__file__, 'bin', 'trgt'))
  • There has been no license work on this code beyond setting the whole package to "BSD 3-Clause Clear License". I haven't checked that all of the dependencies are okay for PacBio to use. For example, this code uses pyarrow which is distributed by apache with an 'Apache-2.0 license`, so that's probably okay. But there are a total of 20 dependencies that would need to be checked. To get the full list of dependencies:
git clone <trgt>
cd trgt/
#Making an isolated virtualenv for trgt
python3 -m venv mtrgtpy  
source mtrgtpy/bin/activate
# manually installing truvari v4.0-dev
cd ../truvari/
python3 -m pip install .
cd -
# installing trgt and its dependencies
python3 -m pip install .
python3 -m pip freeze | wc -l

mainly placeholder stuff
leveraging bit symmetry to enable decoding compliment strand
By creating the database as a collection of independent parquet files, we

1. ensure we're getting the columnar compression
2. enable fast de-identification by simply removing trgtdb/sample*.pq
finding needed changes to the data structures by implementing queries
Need to do lots of testing now and I need to clean up the code
It runs. Now gotta check if its valid
should be able to `trgt db create -o consolidated.tdb input1.vcf input2.vcf.gz input3.tdb ...`
where we consolidate into a new location. Separately there is the command `append dest.tdb input.[tdb|vcf]`
where we consolidate into an existing database.
getting commands to work and filling out library.
Need to start making functional tests on this to ensure its working
Then can move on to multi-sample vcf/tdb. Then will be finished with alpha
simpler allele table building
now handles multi sample vcfs, tdb
To test if letting gzip/parquet do the compression helps
monref
parquet was truncating it for some reason
It will make it a tiny bit smaller, but something weird is happening during consolidation
it compresses better, but the array is unhashable.
can't assume it works, yet
Now time to clean
removing debug cruft from what is now jaccard
new query
locus_ji has parameters on the query. Still need to figure out how to best expose query parameters to the CLI
Made the allele count queries a little more useful with sample subsetting
cleaning tdb_tutorial - might want to remove that for Introduction.ipynb
This change needs to be tested. Also updating the notebook formerly known as ProbandOnly
Making it a little cleaner and correcting an experiment
@zqfang
Copy link

zqfang commented Mar 10, 2024

Hi @ACEnglish,

I got an issue runing the following cmd

trgt db create -o strains.tdb trgt_out/SJL.sorted.vcf.gz   

The error message is ValueError: cannot reindex on an axis with duplicate labels

Traceback (most recent call last):
  File "/home/fangzq/.conda/envs/trgt/bin/trgt", line 33, in <module>
    sys.exit(load_entry_point('TRGT', 'console_scripts', 'trgt')())
  File "/home/fangzq/github/trgt/trgt/__main__.py", line 44, in main
    CMDS[args.cmd](args.options)
  File "/home/fangzq/github/trgt/trgt/dbcmds.py", line 28, in db_main
    CMDS[args.cmd][1](args.options)
  File "/home/fangzq/github/trgt/trgt/database/create.py", line 74, in create_main
    n_data = trgt.load_tdb(i) if i.rstrip('/').endswith(".tdb") else trgt.vcf_to_tdb(i)
  File "/home/fangzq/github/trgt/trgt/database/dbutils.py", line 209, in vcf_to_tdb
    allele_df = pull_alleles(data)
  File "/home/fangzq/github/trgt/trgt/database/dbutils.py", line 156, in pull_alleles
    alleles["LocusID"] = data["LocusID"]

Do you have any preprocessing step for importing trgt output to trgtdb ?

my trgt cmd is

./trgt-v0.8.0-linux_x86_64 --genome mm10.fa --repeats tr_catalog.adjusted.mm10.bed --reads SJL.aligned.sorted.bam --output-prefix trgt_out/SJL --threads 6

@ACEnglish
Copy link
Author

Database tool has been refactored and placed into a repository at https://github.com/ACEnglish/tdb.

@zqfang - Please try from that repository and if the error still happens, open a ticket there.

@ACEnglish ACEnglish closed this Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants