Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Speed up copy_local_genomes.py with symbolic links #181

Merged
merged 11 commits into from
Sep 18, 2022

Conversation

mr-eyes
Copy link
Member

@mr-eyes mr-eyes commented Mar 4, 2022

genome_grist.copy_local_genomes takes forever to copy and compress FASTA files. This PR resolves #138 by adding the command line option --sym to use symbolic links. --sym defaults to False, might be changed later after discussing the PR.

Also upgrading private genomes config file to the latest version in docs (samples instead of sample).

@mr-eyes mr-eyes changed the title [WIP] Speed up copy_local_genomes.py with symbolic links [MRG] Speed up copy_local_genomes.py with symbolic links Mar 4, 2022
@mr-eyes mr-eyes changed the title [MRG] Speed up copy_local_genomes.py with symbolic links [WIP] Speed up copy_local_genomes.py with symbolic links Mar 4, 2022
@mr-eyes
Copy link
Member Author

mr-eyes commented Mar 4, 2022

@ctb symlinks are working. I suspect the issue you observed was symlinking relative paths, not absolute paths, maybe?

The problem now after this PR is that the Snakefile always expects gzipped files to process {{ident}}_genomic.fna.gz, which might not exist if symlinking uncompressed Fasta files. Does genome-grist really need the Fasta files to be compressed?

@ctb
Copy link
Member

ctb commented Mar 4, 2022

cool, thanks!

the sticking point for the compression are the mpileup_wc and build_new_consensus_wc rules, which need to make the files available to bcftools for alignment/pileup purposes. All of the custom code we've written doesn't care about compression or format (because it uses screed). So that could be fixed, but I couldn't think of a simple solution.

Anyhoo, rather than fix the code to figure out if the file is gzipped and then do the right thing, I just decided that compressing files is good and we should be doing it in the first place :)

@mr-eyes
Copy link
Member Author

mr-eyes commented Mar 4, 2022

Perfect! The PR is now ready for the review :) Symlinking will only work now if the Fasta files are already compressed.

@mr-eyes mr-eyes changed the title [WIP] Speed up copy_local_genomes.py with symbolic links [MRG] Speed up copy_local_genomes.py with symbolic links Mar 4, 2022
@mr-eyes
Copy link
Member Author

mr-eyes commented Mar 5, 2022

Observed behavior: Using symlinks will work, but the original filename will show in prefetch results, not the new one in the symlink.

@ctb
Copy link
Member

ctb commented Mar 5, 2022

if it's just in the prefetch CSV filename field, that's probably fine! those paths aren't used for anything.

could you separate out the doc change into its own PR, or do you mind if I do? I think that can/should be merged quickly but I might not get to the symlink PR for a bit.

mr-eyes and others added 6 commits March 5, 2022 18:57
Co-authored-by: C. Titus Brown <titus@idyll.org>
Co-authored-by: C. Titus Brown <titus@idyll.org>
Co-authored-by: C. Titus Brown <titus@idyll.org>
Co-authored-by: C. Titus Brown <titus@idyll.org>
@ctb ctb merged commit 011034d into dib-lab:latest Sep 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

investigate hard links or symlinks for private genome collections
2 participants