Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RNA-Seq TIN QC support #56

Merged
merged 43 commits into from
Sep 19, 2024
Merged

Add RNA-Seq TIN QC support #56

merged 43 commits into from
Sep 19, 2024

Conversation

jonperdomo
Copy link
Contributor

@jonperdomo jonperdomo commented Aug 8, 2024

Add TIN values for RNA-Seq QC from BAM files, including unit tests.

@jonperdomo jonperdomo linked an issue Aug 8, 2024 that may be closed by this pull request
@jonperdomo jonperdomo self-assigned this Aug 8, 2024
@jonperdomo
Copy link
Contributor Author

I test with a GTEx RNA-seq file GTEX-14BMU-0526-SM-5CA2F_rep.FAK93376.bam and compared results with RSeQC. RSeQC TIN.py has default parameters for minimum coverage and sample size, and thus I implement both these parameters for direct comparisons, so that users can expect identical results as RSeQC. For transcripts, I download the latest GENCODE v46 file of basic gene annotations for the GRCh38 reference chromosomes, gencode.v46.basic.annotation.bed from https://www.gencodegenes.org/human/release_46.html

I set minimum coverage to 2, and sample size to 100.
RSeQC:

tin.py -i "${mod_bam}" -r "${bed_file}" -c 2 -n 100
Number of scores: 67069
Mean TIN: 67.089549182989
Median TIN: 74.25578864168884
Standard deviation of TIN: 26.001131242677577

LongReadSum:

longreadsum bam -i "${mod_bam}" -o "${output_dir}" -t 12 --genebed "${bed_file}" --min-coverage 2 --sample-size 100
Number of scores: 67069
Mean TIN: 67.0683
Median TIN: 74.25
Standard deviation of TIN: 26.0379

@jonperdomo
Copy link
Contributor Author

This PR will also address the help text error from issue #57

@jonperdomo
Copy link
Contributor Author

Updated results with high precision.

TIN Results

RSeQC:

tin.py -i "${mod_bam}" -r "${bed_file}" -c 2 -n 100
Number of scores: 67069
Mean TIN: 67.089549182989
Median TIN: 74.25578864168884
Standard deviation of TIN: 26.001131242677577

LongReadSum:

longreadsum bam -i "${mod_bam}" -o "${output_dir}" -t 12 --genebed "${bed_file}" --min-coverage 2 --sample-size 100
Number of scores: 67069
Mean TIN: 67.06832655372376
Median TIN: 74.24996965188242
Standard deviation of TIN: 26.03788585287367

Performance comparison (--mem=50G, --cpus-per-task=8, --time=12:00:00) with seff:

RSeQC:

Nodes: 1
Cores per node: 8
CPU Utilized: 07:55:21
CPU Efficiency: 12.45% of 2-15:39:12 core-walltime
Job Wall-clock time: 07:57:24
Memory Utilized: 166.25 MB
Memory Efficiency: 0.32% of 50.00 GB

LongReadSum:

Nodes: 1
Cores per node: 8
CPU Utilized: 02:48:34
CPU Efficiency: 12.67% of 22:10:56 core-walltime
Job Wall-clock time: 02:46:22
Memory Utilized: 5.91 GB
Memory Efficiency: 11.83% of 50.00 GB

@jonperdomo
Copy link
Contributor Author

Add a unit test to complete this PR.

@jonperdomo
Copy link
Contributor Author

This PR adds a new feature for calculating TIN scores, yielding the scores and their summary statistics in TSV format, and adding this summary to the html report:

image

@jonperdomo jonperdomo marked this pull request as ready for review September 19, 2024 16:54
@jonperdomo jonperdomo merged commit b000dfb into main Sep 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add TIN QC for RNA-seq data
1 participant