diff --git a/.github/workflows/cmake_htslib.yml b/.github/workflows/cmake_htslib.yml index 47ef662..9e12123 100644 --- a/.github/workflows/cmake_htslib.yml +++ b/.github/workflows/cmake_htslib.yml @@ -29,7 +29,7 @@ jobs: htslib_matrix: strategy: matrix: - version: [1.11, 1.12, 1.13, 1.14, 1.15, 1.16, 1.17, 1.18] + version: [1.12, 1.13, 1.14, 1.15, 1.16, 1.17, 1.18] os: [ubuntu-22.04, ubuntu-20.04, ubuntu-latest] # The CMake configure and build commands are platform agnostic and should work equally well on Windows or Mac. diff --git a/INSTALL.md b/INSTALL.md index 0125311..cf1df57 100644 --- a/INSTALL.md +++ b/INSTALL.md @@ -9,7 +9,7 @@ levioSAM2 supports a variety of methods for installation: ## Conda -``` +```shell conda install -c conda-forge -c bioconda leviosam2 ``` @@ -17,13 +17,13 @@ conda install -c conda-forge -c bioconda leviosam2 You can obtain a Docker image of the latest version from Docker hub: -``` +```shell docker pull naechyun/leviosam2:latest ``` ## Singularity -``` +```shell singularity pull docker://naechyun/leviosam2:latest ``` @@ -33,34 +33,37 @@ singularity pull docker://naechyun/leviosam2:latest Make sure the following prerequisite libraries are installed on your system. -- [htslib v1.11+ (tested up to v1.16)](https://github.com/samtools/htslib) +- [htslib v1.12+ (tested up to v1.18)](https://github.com/samtools/htslib) - [sdsl-lite v2.1.1+](https://github.com/simongog/sdsl-lite/) Both libraries are available through coda: -``` + +```shell conda install -c conda-forge sdsl-lite conda install -c bioconda htslib ``` Another easy way to install these dependencies is to use your OS's existing package system: -``` + +```shell apt-get install libhts-dev libsdsl-dev # Debian/Ubuntu brew tap brewsci/bio; brew install htslib sdsl-lite # MacOS ``` If using RedHat or Fedora, then you must install sdsl-lite manually. But you can install htslib through yum: -``` + +```shell yum install htslib ``` Or you can choose to install them manually by following the install instructions on their respective pages. -### CMake +### CMake Once the prerequisite packages are installed and the locations of their installations are known, specify their locations to CMake by running the following commands: -``` +```shell mkdir build cd build cmake .. @@ -73,12 +76,11 @@ make If you installed the dependencies manually, you might have to modify the `cmake` command to specify their library and include directory locations like so: -``` +```shell cmake -D CMAKE_LIBRARY_PATH="/path/to/libsdsl/;/path/to/libhts/" \ -D CMAKE_INCLUDE_PATH="/path/to/libsdsl/include/;/path/to/libhts/include/" .. ``` - ## Test We provide an end-to-end test and a set of unit tests for levioSAM. @@ -87,7 +89,6 @@ We provide an end-to-end test and a set of unit tests for levioSAM. - The unit test can be run `cd build; ctest` if you use cmake to build levioSAM; or `make gtest; cd testdata; ../gtest` if you use make to build levioSAM2. - ## ARM64 LevioSAM2 supports the arm64 architecture. Note that the distributed libraries of the dependencies (namely `htslib` and `sdsl-lite`) on Conda or other package managers might not support arm64. Thus, you might need to build the dependent libraries from source. Both `htslib` and `sdsl-lite` can be built under the arm64 architecture. diff --git a/README.md b/README.md index aefbd80..ed5a0e5 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,11 @@ +# LevioSAM2: Fast and accurate coordinate conversion between assemblies + [![Docker](https://img.shields.io/docker/v/naechyun/leviosam2?label=Docker)](https://hub.docker.com/r/naechyun/leviosam2) [![Anaconda-Server Badge](https://anaconda.org/bioconda/leviosam2/badges/version.svg)](https://anaconda.org/bioconda/leviosam2) [![Anaconda-Server Badge](https://anaconda.org/bioconda/leviosam2/badges/downloads.svg)](https://anaconda.org/bioconda/leviosam2) ![Cmake build](https://github.com/milkschen/leviosam2/actions/workflows/cmake_htslib.yml/badge.svg) ![Integration](https://github.com/milkschen/leviosam2/actions/workflows/integration-test.yml/badge.svg) -# LevioSAM2: Fast and accurate coordinate conversion between assemblies - LevioSAM2 lifts over alignments accurately and efficiently using a chain file. @@ -17,9 +17,9 @@ LevioSAM2 lifts over alignments accurately and efficiently using a chain file. - Converting aligned short and long reads records (in SAM/BAM/CRAM format) from one reference to another - Comprehensive alignment feature updating during lift-over: - - Reference name (`RNAME`), position (`POS`), alignmant flag (`FLAG`), and CIGAR alignment string (`CIGAR`) - - Mate read information (`RNEXT`, `PNEXT`, `TLEN`) - - (optional) Alignment tags (`MD:Z`, `NM:i`) + - Reference name (`RNAME`), position (`POS`), alignmant flag (`FLAG`), and CIGAR alignment string (`CIGAR`) + - Mate read information (`RNEXT`, `PNEXT`, `TLEN`) + - (optional) Alignment tags (`MD:Z`, `NM:i`) - Multithreading support - Toolkit for "selective" pipelines which consider major changes between the source and target references - (beta) Converting intervals (in BED format) from one reference to another @@ -30,7 +30,7 @@ LevioSAM2 can be installed using: - [Conda](https://anaconda.org/bioconda/leviosam2) -``` +```shell # The following commands install leviosam2 in a new conda environment called `leviosam2` conda create -n leviosam2 conda activate leviosam2 @@ -38,12 +38,14 @@ conda install -c bioconda -c conda-forge leviosam2 ``` - [Docker](https://hub.docker.com/r/naechyun/leviosam2) -``` + +```shell docker pull naechyun/leviosam2:latest ``` - [Singularity](https://hub.docker.com/r/naechyun/leviosam2) -``` + +```shell singularity pull docker://naechyun/leviosam2:latest ``` @@ -62,7 +64,7 @@ For other reference pairs, common ways to generate chain files include using the LevioSAM2 indexes a chain file for lift-over queries. The resulting index has a `.clft` extension. -``` +```shell leviosam2 index -c source_to_target.chain -p source_to_target -F target.fai ``` @@ -73,17 +75,17 @@ The levioSAM2 ChainMap index will be saved to `source_to_target.clft`. The outpu __We highly recommend to sort the input BAM by position prior to running levioSAM2-lift.__ -``` +```shell leviosam2 lift -C source_to_target.clft -a aligned_to_source.bam -p lifted_from_source -O bam ``` - ### Full levioSAM2 workflow with selective re-mapping The levioSAM2 workflow includes lift-over using the `leviosam2-lift` kernel and a selective re-mapping strategy. This approach can improve accuracy. Example: -``` + +```shell # You may skip the indexing step if you've already run it leviosam2 index -c source_to_target.chain -p source_to_target -F target.fai sh leviosam2.sh \ @@ -98,10 +100,9 @@ sh leviosam2.sh \ See [this README](https://github.com/milkschen/leviosam2/blob/main/workflow/README.md) to learn more about running the full levioSAM2 workflow. - ## Publication -- Nae-Chyun Chen, Luis Paulin, Fritz Sedlazeck, Sergey Koren, Adam Phillippy, Ben Langmead. Improved sequence mapping using a complete reference genome and lift-over. Nat Methods (2023). https://doi.org/10.1038/s41592-023-02069-6 +- Nae-Chyun Chen, Luis Paulin, Fritz Sedlazeck, Sergey Koren, Adam Phillippy, Ben Langmead. Improved sequence mapping using a complete reference genome and lift-over. Nat Methods (2023). https://doi.org/10.1038/s41592-023-02069-6 - Taher Mun, Nae-Chyun Chen, Ben Langmead. LevioSAM: Fast lift-over of variant-aware reference alignments, _Bioinformatics_, 2021;, btab396, https://doi.org/10.1093/bioinformatics/btab396 _Logo credit: Ting-Wei Young_ diff --git a/src/chain.cpp b/src/chain.cpp index 1487efa..8077487 100644 --- a/src/chain.cpp +++ b/src/chain.cpp @@ -822,8 +822,16 @@ int32_t ChainMap::get_num_clipped(const int32_t pos, const bool leftmost, return -1; } -/* Lift an alignment segment (there are two segments in a paired-end alignment) - * Return false if unliftable. +/** + * Lifts an alignment segment (a paired-end read has two segments). + * + * @param aln Alignment object to be lifted. + * @param hdr_source Source BAM header. + * @param hdr_dest Target BAM header. + * @param first_seg True if the first segment in a paired-end alignment. False + * if the mate. + * @param dest_contig Target reference config. + * @return True if liftable. */ bool ChainMap::lift_segment(bam1_t *aln, sam_hdr_t *hdr_source, sam_hdr_t *hdr_dest, bool first_seg, @@ -850,9 +858,7 @@ bool ChainMap::lift_segment(bam1_t *aln, sam_hdr_t *hdr_source, // the segment is unliftable if (!update_interval_indexes(source_contig, pos, start_sidx, start_eidx)) return false; - // Set a segment as unliftable if the gap between it - // and the nearby interval is too large - // + // sidx might be advanced here int32_t num_sclip_start = get_num_clipped(pos, true, source_contig, start_sidx, start_eidx); @@ -888,13 +894,17 @@ bool ChainMap::lift_segment(bam1_t *aln, sam_hdr_t *hdr_source, // Only check mate if paired-end if (c->tid < 0 || (c->flag & 1 && c->mtid < 0)) return false; - // Estimate ending pos of the mate - // If it's the first segment, we can calculate the query length wrt the REF; - // If it's the second segment, we can only do a rough estimate b/c we don't - // know the RLEN of the mate. - auto pos_end = (first_seg) - ? c->pos + bam_cigar2rlen(c->n_cigar, bam_get_cigar(aln)) - : c->mpos + c->l_qseq; + // Gets the ending pos of the mate. + // + // For the first segment, uses the cigar string to calculate the read + // reference length. + // + // For the mate segment, infers the ending position. The position would be + // accurate if the "MC:Z" tag is available. + auto pos_end = + (first_seg) ? c->pos + bam_cigar2rlen(c->n_cigar, bam_get_cigar(aln)) + : c->mpos + LevioSamUtils::get_mate_query_len_on_ref(aln); + int end_sidx = 0; int end_eidx = 0; // Update end indexes; if either is unavailable, mark the segment as diff --git a/src/chain_test.cpp b/src/chain_test.cpp index b7e721a..f60fb42 100644 --- a/src/chain_test.cpp +++ b/src/chain_test.cpp @@ -18,11 +18,6 @@ #include "gtest/gtest.h" #include "leviosam_utils.hpp" -size_t kstr_get_m(size_t var) { - size_t lvar = (size_t)exp2(ceil(log2(var))); - return lvar; -} - /* Chain tests */ TEST(ChainTest, SimpleRankAndLift) { std::vector> lm; diff --git a/src/leviosam.cpp b/src/leviosam.cpp index 8e75f92..8ea7b32 100644 --- a/src/leviosam.cpp +++ b/src/leviosam.cpp @@ -220,7 +220,7 @@ void read_and_lift(T *lift_map, std::mutex *mutex_fread, ref.substr(aln_vec[i]->core.pos, bam_cigar2rlen(aln_vec[i]->core.n_cigar, bam_get_cigar(aln_vec[i]))); - std::string q_seq = LevioSamUtils::get_read_as_is(aln_vec[i]); + std::string q_seq = LevioSamUtils::get_read(aln_vec[i]); std::vector new_cigar; // We perform global alignment for local sequences and // thus an alignment with a high number of clipped bases diff --git a/src/leviosam_test.cpp b/src/leviosam_test.cpp index db4b72a..4e183c5 100644 --- a/src/leviosam_test.cpp +++ b/src/leviosam_test.cpp @@ -9,15 +9,17 @@ * Distributed under the MIT license * https://github.com/milkschen/leviosam2 */ -#include +#include "leviosam.hpp" + #include + +#include #include -#include "leviosam.hpp" -#include "leviosam_utils.hpp" -#include "htslib/sam.h" -#include "gtest/gtest.h" -#include "bed.hpp" +#include "bed.hpp" +#include "gtest/gtest.h" +#include "htslib/sam.h" +#include "leviosam_utils.hpp" // // bit_vector tests @@ -30,7 +32,7 @@ TEST(BitVector, SimpleCreateAndAccess) { sdsl::bit_vector bv(n, 0); EXPECT_EQ(0, bv[0]); EXPECT_EQ(0, bv[500]); - EXPECT_EQ(0, bv[10000-1]); + EXPECT_EQ(0, bv[10000 - 1]); } TEST(BitVector, WriteAndRead) { @@ -38,14 +40,14 @@ TEST(BitVector, WriteAndRead) { sdsl::bit_vector bv(n, 0); bv[0] = 1; bv[500] = 1; - bv[10000-1] = 1; + bv[10000 - 1] = 1; EXPECT_EQ(1, bv[0]); EXPECT_EQ(0, bv[1]); EXPECT_EQ(0, bv[499]); EXPECT_EQ(1, bv[500]); EXPECT_EQ(0, bv[501]); - EXPECT_EQ(0, bv[10000-2]); - EXPECT_EQ(1, bv[10000-1]); + EXPECT_EQ(0, bv[10000 - 2]); + EXPECT_EQ(1, bv[10000 - 1]); } TEST(BitVector, RankSupport) { @@ -53,7 +55,7 @@ TEST(BitVector, RankSupport) { sdsl::bit_vector bv(n, 0); bv[0] = 1; bv[500] = 1; - bv[10000-1] = 1; + bv[10000 - 1] = 1; sdsl::rank_support_v<> rs(&bv); EXPECT_EQ(0, rs.rank(0)); EXPECT_EQ(1, rs.rank(1)); @@ -62,19 +64,19 @@ TEST(BitVector, RankSupport) { EXPECT_EQ(1, rs.rank(500)); EXPECT_EQ(2, rs.rank(501)); EXPECT_EQ(2, rs.rank(601)); - EXPECT_EQ(2, rs.rank(10000-1)); - EXPECT_EQ(3, rs.rank(10000-0)); + EXPECT_EQ(2, rs.rank(10000 - 1)); + EXPECT_EQ(3, rs.rank(10000 - 0)); } TEST(BitVector, SelectSupport) { sdsl::bit_vector bv(10000, 0); bv[0] = 1; bv[500] = 1; - bv[10000-1] = 1; + bv[10000 - 1] = 1; sdsl::bit_vector::select_1_type ss(&bv); EXPECT_EQ(0, ss.select(1)); EXPECT_EQ(500, ss.select(2)); - EXPECT_EQ(10000-1, ss.select(3)); + EXPECT_EQ(10000 - 1, ss.select(3)); } TEST(CompressedVector, CreateAndRead) { @@ -82,15 +84,15 @@ TEST(CompressedVector, CreateAndRead) { sdsl::bit_vector bv(n, 0); bv[0] = 1; bv[500] = 1; - bv[10000-1] = 1; + bv[10000 - 1] = 1; sdsl::sd_vector<> sv{bv}; EXPECT_EQ(1, sv[0]); EXPECT_EQ(0, sv[1]); EXPECT_EQ(0, sv[499]); EXPECT_EQ(1, sv[500]); EXPECT_EQ(0, sv[501]); - EXPECT_EQ(0, sv[10000-2]); - EXPECT_EQ(1, sv[10000-1]); + EXPECT_EQ(0, sv[10000 - 2]); + EXPECT_EQ(1, sv[10000 - 1]); sdsl::rrr_vector<63> rv{bv}; EXPECT_EQ(1, rv[0]); @@ -98,8 +100,8 @@ TEST(CompressedVector, CreateAndRead) { EXPECT_EQ(0, rv[499]); EXPECT_EQ(1, rv[500]); EXPECT_EQ(0, rv[501]); - EXPECT_EQ(0, rv[10000-2]); - EXPECT_EQ(1, rv[10000-1]); + EXPECT_EQ(0, rv[10000 - 2]); + EXPECT_EQ(1, rv[10000 - 1]); } TEST(CompressedVector, RankSupport) { @@ -107,7 +109,7 @@ TEST(CompressedVector, RankSupport) { sdsl::bit_vector bv(n, 0); bv[0] = 1; bv[500] = 1; - bv[10000-1] = 1; + bv[10000 - 1] = 1; sdsl::sd_vector<> sv{bv}; sdsl::sd_vector<>::rank_1_type rs(&sv); EXPECT_EQ(0, rs.rank(0)); @@ -117,20 +119,20 @@ TEST(CompressedVector, RankSupport) { EXPECT_EQ(1, rs.rank(500)); EXPECT_EQ(2, rs.rank(501)); EXPECT_EQ(2, rs.rank(601)); - EXPECT_EQ(2, rs.rank(10000-1)); - EXPECT_EQ(3, rs.rank(10000-0)); + EXPECT_EQ(2, rs.rank(10000 - 1)); + EXPECT_EQ(3, rs.rank(10000 - 0)); } TEST(CompressedVector, SelectSupport) { sdsl::bit_vector bv(10000, 0); bv[0] = 1; bv[500] = 1; - bv[10000-1] = 1; + bv[10000 - 1] = 1; sdsl::sd_vector<> sv{bv}; sdsl::sd_vector<>::select_1_type ss{&sv}; EXPECT_EQ(0, ss.select(1)); EXPECT_EQ(500, ss.select(2)); - EXPECT_EQ(10000-1, ss.select(3)); + EXPECT_EQ(10000 - 1, ss.select(3)); } /* Lift tests */ @@ -138,7 +140,7 @@ TEST(CompressedVector, SelectSupport) { TEST(Lift, EmptyLift) { sdsl::bit_vector ibv; sdsl::bit_vector dbv; - sdsl::bit_vector sbv; // ins, del, snp + sdsl::bit_vector sbv; // ins, del, snp ibv.resize(20); dbv.resize(20); sbv.resize(20); @@ -149,7 +151,7 @@ TEST(Lift, EmptyLift) { TEST(Lift, SimpleInsLift) { sdsl::bit_vector ibv; sdsl::bit_vector dbv; - sdsl::bit_vector sbv; // ins, del, snp + sdsl::bit_vector sbv; // ins, del, snp ibv.resize(20); dbv.resize(20); sbv.resize(20); @@ -165,7 +167,7 @@ TEST(Lift, SimpleInsLift) { TEST(Lift, SimpleDelLift) { sdsl::bit_vector ibv; sdsl::bit_vector dbv; - sdsl::bit_vector sbv; // ins, del, snp + sdsl::bit_vector sbv; // ins, del, snp ibv.resize(20); dbv.resize(20); sbv.resize(20); @@ -178,7 +180,6 @@ TEST(Lift, SimpleDelLift) { EXPECT_EQ(lift.lift_pos(6), 7); } - TEST(LiftMap, SimpleBamLift) { vcfFile* vcf_fp = bcf_open("simple_example.vcf", "r"); bcf_hdr_t* vcf_hdr = bcf_hdr_read(vcf_fp); @@ -214,18 +215,18 @@ TEST(LiftMap, SimpleBamCigarLift) { lmap.lift_cigar(sam_hdr->target_name[aln->core.tid], aln); test_cigar = bam_get_cigar(aln); EXPECT_EQ(aln->core.n_cigar, 3); - EXPECT_EQ(test_cigar[0], bam_cigar_gen( 11, BAM_CMATCH)); - EXPECT_EQ(test_cigar[1], bam_cigar_gen( 2, BAM_CDEL)); - EXPECT_EQ(test_cigar[2], bam_cigar_gen( 9, BAM_CMATCH)); + EXPECT_EQ(test_cigar[0], bam_cigar_gen(11, BAM_CMATCH)); + EXPECT_EQ(test_cigar[1], bam_cigar_gen(2, BAM_CDEL)); + EXPECT_EQ(test_cigar[2], bam_cigar_gen(9, BAM_CMATCH)); err = sam_read1(sam_fp, sam_hdr, aln); EXPECT_EQ(err, 0); lmap.lift_cigar(sam_hdr->target_name[aln->core.tid], aln); test_cigar = bam_get_cigar(aln); EXPECT_EQ(aln->core.n_cigar, 3); - EXPECT_EQ(test_cigar[0], bam_cigar_gen( 15, BAM_CMATCH)); - EXPECT_EQ(test_cigar[1], bam_cigar_gen( 1, BAM_CDEL)); - EXPECT_EQ(test_cigar[2], bam_cigar_gen( 5, BAM_CMATCH)); + EXPECT_EQ(test_cigar[0], bam_cigar_gen(15, BAM_CMATCH)); + EXPECT_EQ(test_cigar[1], bam_cigar_gen(1, BAM_CDEL)); + EXPECT_EQ(test_cigar[2], bam_cigar_gen(5, BAM_CMATCH)); sam_hdr_destroy(sam_hdr); sam_close(sam_fp); @@ -245,18 +246,18 @@ TEST(LiftMap, DelInIndelBamCigarLift) { lmap.lift_cigar(sam_hdr->target_name[aln->core.tid], aln); test_cigar = bam_get_cigar(aln); EXPECT_EQ(aln->core.n_cigar, 3); - EXPECT_EQ(test_cigar[0], bam_cigar_gen( 7, BAM_CMATCH)); - EXPECT_EQ(test_cigar[1], bam_cigar_gen( 15, BAM_CDEL)); - EXPECT_EQ(test_cigar[2], bam_cigar_gen( 13, BAM_CMATCH)); + EXPECT_EQ(test_cigar[0], bam_cigar_gen(7, BAM_CMATCH)); + EXPECT_EQ(test_cigar[1], bam_cigar_gen(15, BAM_CDEL)); + EXPECT_EQ(test_cigar[2], bam_cigar_gen(13, BAM_CMATCH)); err = sam_read1(sam_fp, sam_hdr, aln); EXPECT_EQ(err, 0); lmap.lift_cigar(sam_hdr->target_name[aln->core.tid], aln); test_cigar = bam_get_cigar(aln); EXPECT_EQ(aln->core.n_cigar, 3); - EXPECT_EQ(test_cigar[0], bam_cigar_gen( 8, BAM_CMATCH)); - EXPECT_EQ(test_cigar[1], bam_cigar_gen( 10, BAM_CDEL)); - EXPECT_EQ(test_cigar[2], bam_cigar_gen( 8, BAM_CMATCH)); + EXPECT_EQ(test_cigar[0], bam_cigar_gen(8, BAM_CMATCH)); + EXPECT_EQ(test_cigar[1], bam_cigar_gen(10, BAM_CDEL)); + EXPECT_EQ(test_cigar[2], bam_cigar_gen(8, BAM_CMATCH)); sam_hdr_destroy(sam_hdr); sam_close(sam_fp); @@ -276,24 +277,23 @@ TEST(LiftMap, SimpleSplicedBamCigarLift) { lmap.lift_cigar(sam_hdr->target_name[aln->core.tid], aln); test_cigar = bam_get_cigar(aln); EXPECT_EQ(aln->core.n_cigar, 3); - EXPECT_EQ(test_cigar[0], bam_cigar_gen( 7, BAM_CMATCH)); - EXPECT_EQ(test_cigar[1], bam_cigar_gen( 15, BAM_CREF_SKIP)); - EXPECT_EQ(test_cigar[2], bam_cigar_gen( 13, BAM_CMATCH)); + EXPECT_EQ(test_cigar[0], bam_cigar_gen(7, BAM_CMATCH)); + EXPECT_EQ(test_cigar[1], bam_cigar_gen(15, BAM_CREF_SKIP)); + EXPECT_EQ(test_cigar[2], bam_cigar_gen(13, BAM_CMATCH)); err = sam_read1(sam_fp, sam_hdr, aln); EXPECT_EQ(err, 0); lmap.lift_cigar(sam_hdr->target_name[aln->core.tid], aln); test_cigar = bam_get_cigar(aln); EXPECT_EQ(aln->core.n_cigar, 3); - EXPECT_EQ(test_cigar[0], bam_cigar_gen( 8, BAM_CMATCH)); - EXPECT_EQ(test_cigar[1], bam_cigar_gen( 10, BAM_CREF_SKIP)); - EXPECT_EQ(test_cigar[2], bam_cigar_gen( 8, BAM_CMATCH)); + EXPECT_EQ(test_cigar[0], bam_cigar_gen(8, BAM_CMATCH)); + EXPECT_EQ(test_cigar[1], bam_cigar_gen(10, BAM_CREF_SKIP)); + EXPECT_EQ(test_cigar[2], bam_cigar_gen(8, BAM_CMATCH)); sam_hdr_destroy(sam_hdr); sam_close(sam_fp); } - TEST(HtslibTest, CigarOpType) { auto typeM = bam_cigar_type(BAM_CMATCH); auto typeD = bam_cigar_type(BAM_CDEL); @@ -312,13 +312,12 @@ TEST(HtslibTest, CigarOpType) { EXPECT_EQ(typeI & 1, 1); EXPECT_EQ(typeI & 2, 0); EXPECT_EQ(typeI & 3, 1); - + EXPECT_EQ(typeS & 1, 1); EXPECT_EQ(typeS & 2, 0); EXPECT_EQ(typeS & 3, 1); } - TEST(UtilsTest, StrToSet) { std::set s = LevioSamUtils::str_to_set("a,b", ","); EXPECT_EQ(s.size(), 2); @@ -327,7 +326,6 @@ TEST(UtilsTest, StrToSet) { EXPECT_EQ(s.find("c"), s.end()); } - TEST(UtilsTest, UpdateFlagUnmap) { samFile* sam_fp = sam_open("simple_example.sam", "r"); sam_hdr_t* sam_hdr = sam_hdr_read(sam_fp); @@ -348,7 +346,6 @@ TEST(UtilsTest, UpdateFlagUnmap) { sam_close(sam_fp); } - TEST(UltimaGenomicsTest, UpdateFlags) { samFile* sam_fp = sam_open("ultima_small.sam", "r"); sam_hdr_t* sam_hdr = sam_hdr_read(sam_fp); @@ -360,17 +357,15 @@ TEST(UltimaGenomicsTest, UpdateFlags) { LevioSamUtils::update_ultima_genomics_tags(aln, true); // Tests the tp tag - uint8_t *tp_ptr = bam_aux_get(aln, "tp"); + uint8_t* tp_ptr = bam_aux_get(aln, "tp"); EXPECT_EQ(bam_auxB_len(tp_ptr), 10); - std::vector exp_tp_array { - -1,-1,0,1,-1,1,1,1,1,-1 - }; + std::vector exp_tp_array{-1, -1, 0, 1, -1, 1, 1, 1, 1, -1}; for (uint32_t i = 0; i < 10; i++) { - EXPECT_EQ(bam_auxB2i(tp_ptr, i), exp_tp_array[i]); + EXPECT_EQ(bam_auxB2i(tp_ptr, i), exp_tp_array[i]); } // Tests the t0 tag - uint8_t *t0_ptr = bam_aux_get(aln, "t0"); + uint8_t* t0_ptr = bam_aux_get(aln, "t0"); std::string t0_str = bam_aux2Z(t0_ptr); EXPECT_EQ(t0_str, "IIIII22222"); @@ -378,8 +373,40 @@ TEST(UltimaGenomicsTest, UpdateFlags) { sam_close(sam_fp); } +TEST(ChainTest, test_get_mate_query_len_on_ref) { + std::string hdr_str = + "@HD VN:1.0 SO:unsorted\n@SQ SN:chr1 LN:248956422"; + sam_hdr_t* sam_hdr = sam_hdr_parse(hdr_str.length(), &hdr_str[0]); + bam1_t* aln = bam_init1(); + int err; + + kstring_t str1; + std::string record1 = + "read1 81 chr1 145334831 33 6S10M " + "= 1245932 0 ATTACATTCCATTCCA ~~~~~~~~~~~~~~~~ MC:Z:10M"; + str1.s = (char*)record1.c_str(); + str1.l = record1.length(); + str1.m = kstr_get_m(str1.l); + err = sam_parse1(&str1, sam_hdr, aln); + EXPECT_EQ(err, 0); + EXPECT_EQ(LevioSamUtils::get_mate_query_len_on_ref(aln), 10); + bam_destroy1(aln); + + aln = bam_init1(); + kstring_t str2; + std::string record2 = + "read1 81 chr1 145334831 33 6S10M " + "= 1245932 0 ATTACATTCCATTCCA ~~~~~~~~~~~~~~~~ MC:Z:6S5M1I4M"; + str2.s = (char*)record2.c_str(); + str2.l = record2.length(); + str2.m = kstr_get_m(str2.l); + err = sam_parse1(&str2, sam_hdr, aln); + EXPECT_EQ(err, 0); + EXPECT_EQ(LevioSamUtils::get_mate_query_len_on_ref(aln), 9); + bam_destroy1(aln); +} -int main(int argc, char **argv){ +int main(int argc, char** argv) { testing::InitGoogleTest(&argc, argv); return RUN_ALL_TESTS(); } diff --git a/src/leviosam_utils.cpp b/src/leviosam_utils.cpp index 7c122b3..cbd0d81 100644 --- a/src/leviosam_utils.cpp +++ b/src/leviosam_utils.cpp @@ -12,8 +12,14 @@ #include "leviosam_utils.hpp" #include +#include #include +size_t kstr_get_m(size_t var) { + size_t lvar = (size_t)std::exp2(std::ceil(std::log2(var))); + return lvar; +} + namespace LevioSamUtils { void WriteDeferred::init( @@ -238,7 +244,10 @@ void debug_print_cigar(uint32_t* cigar, size_t n_cigar) { std::cerr << "\n"; } -/* Remove MN:i and MD:z tags from an alignment (bam1_t) object */ +/** Removes the MN:i and MD:z tags from an alignment object. + * + * @param aln Alignment object. + */ void remove_nm_md_tag(bam1_t* aln) { uint8_t* ptr = NULL; if ((ptr = bam_aux_get(aln, "MD")) != NULL) { @@ -249,11 +258,36 @@ void remove_nm_md_tag(bam1_t* aln) { } } -/* Return the read, reverse complemented if necessary - Adapted from: https://github.com/samtools/samtools/blob/develop/bam_fastq.c -*/ -std::string get_read(const bam1_t* rec) { - int len = rec->core.l_qseq + 1; +/** Gets the reference length of the mate segment. + * + * Leverages the "MC:Z" flag when available. If not, infers the query length. + * + * @param aln Alignment object. + * @return Reference length of the mate segment. + */ +hts_pos_t get_mate_query_len_on_ref(const bam1_t* aln) { + const bam1_core_t* c = &(aln->core); + uint8_t* ptr = NULL; + uint8_t* mc = bam_aux_get(aln, "MC"); + if (mc) { + char* mate_cigar = bam_aux2Z(mc); + uint32_t* a_cigar = NULL; + size_t n_cigar = 0; + sam_parse_cigar(mate_cigar, NULL, &a_cigar, &n_cigar); + return bam_cigar2rlen(n_cigar, a_cigar); + } else { + return c->l_qseq; + } +} + +/** Gets the forward read sequence. + * + * Adapted from: https://github.com/samtools/samtools/blob/develop/bam_fastq.c + * + * @param rec Alignment obejct. + * @return Read sequence string. + */ +std::string get_forward_read(const bam1_t* rec) { char* seq = (char*)bam_get_seq(rec); std::string read = ""; @@ -267,9 +301,14 @@ std::string get_read(const bam1_t* rec) { return read; } -/* Return the read, regardless of reverse complement status */ -std::string get_read_as_is(const bam1_t* rec) { - int len = rec->core.l_qseq + 1; +/** Gets the aligned read sequence. + * + * Use `get_forward_read()` if interested in the forward read sequence. + * + * @param rec Alignment obejct. + * @return Read sequence string. + */ +std::string get_read(const bam1_t* rec) { char* seq = (char*)bam_get_seq(rec); std::string read = ""; @@ -360,7 +399,7 @@ FastqRecord::~FastqRecord() { /* Write a FastqRecord object to a FASTQ file */ int FastqRecord::write(ogzstream& out_fq, std::string name) { if (aln != NULL) { - seq_str = get_read(aln); + seq_str = get_forward_read(aln); std::string qual_seq(""); uint8_t* qual = bam_get_qual(aln); if (qual[0] == 255) diff --git a/src/leviosam_utils.hpp b/src/leviosam_utils.hpp index 4e27551..f65a1c1 100644 --- a/src/leviosam_utils.hpp +++ b/src/leviosam_utils.hpp @@ -34,6 +34,8 @@ const int8_t seq_comp_table[16] = {0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15}; +size_t kstr_get_m(size_t var); + namespace LevioSamUtils { const std::vector DEFER_OPT{"lifted", "mapq", "clipped_frac", @@ -88,8 +90,7 @@ class WriteDeferred { const BedUtils::Bed& b_defer_source, const BedUtils::Bed& b_defer_dest, const BedUtils::Bed& b_commit_source, - const BedUtils::Bed& b_commit_dest, - const float& b_isec_th); + const BedUtils::Bed& b_commit_dest, const float& b_isec_th); void print_info(); void write_deferred_bam(bam1_t* aln, sam_hdr_t* hdr); @@ -118,9 +119,15 @@ class WriteDeferred { void update_cigar(bam1_t* aln, std::vector& new_cigar); void debug_print_cigar(uint32_t* cigar, size_t n_cigar); + +/** @brief Remove the MN:i and MD:z tags from an alignment object. */ void remove_nm_md_tag(bam1_t* aln); + +/** @brief Gets the reference length of the mate segment. */ +hts_pos_t get_mate_query_len_on_ref(const bam1_t* aln); + +std::string get_forward_read(const bam1_t* rec); std::string get_read(const bam1_t* rec); -std::string get_read_as_is(const bam1_t* rec); void update_flag_unmap(bam1_t* aln, const bool first_seg, const bool keep_mapq); std::vector str_to_vector(const std::string str, @@ -128,11 +135,11 @@ std::vector str_to_vector(const std::string str, std::set str_to_set(const std::string str, const std::string regex_str); -/// @brief Returns true if a bam record is reversely lifted. +/** @brief Returns true if a bam record is reversely lifted. */ bool is_reversedly_lifted(bam1_t* aln_orig, bam1_t* aln_lft); -/// @brief Updates BAM tags specific to Ultima Genomics. -void update_ultima_genomics_tags(bam1_t *aln, bool rev); +/** @brief Updates BAM tags specific to Ultima Genomics. */ +void update_ultima_genomics_tags(bam1_t* aln, bool rev); int reverse_seq_and_qual(bam1_t* aln); sam_hdr_t* fai_to_hdr(std::string fai_fn, const sam_hdr_t* const hdr_orig); diff --git a/workflow/README.md b/workflow/README.md index 7812be5..dc06832 100644 --- a/workflow/README.md +++ b/workflow/README.md @@ -81,7 +81,7 @@ bash leviosam2.sh \ For both short and long reads. Different parameters are recommended for each sequence type. - Supported aligners: [minimap2](https://github.com/lh3/minimap2), [winnowmap2](https://github.com/marbl/Winnowmap), -[bwa mem](https://github.com/lh3/bwa) and [Bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) +[bwa mem](https://github.com/lh3/bwa) and [Bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) - `-g` sets the max allowed size of overlapping chain gaps of an alignment, `-H` is the max edit distance of an alignment (post-realignment) to be committed. Please adjust these parameters according to your long read platform and library preparation diff --git a/workflow/leviosam2.py b/workflow/leviosam2.py index 0463a3c..53cc733 100644 --- a/workflow/leviosam2.py +++ b/workflow/leviosam2.py @@ -188,7 +188,6 @@ def parse_args() -> argparse.Namespace: "-O", "--out_format", type=str, - required=True, default="bam", choices=["bam", "sam", "cram"], help="Output file format.",