Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fraser2 & docs #582

Merged
merged 5 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion docs/source/prepare.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ Calling variants on RNA-seq data may be useful for researchers who do not have a
The RNA variant calling process uses information from multiple samples (as designated by the ``groups`` variable) to improve the quality of the called variants. However, the larger the group size, the more costly the computation is in terms of time and resources. To prioritize accuracy, include many samples in each ``DROP_GROUP``, and to prioritize speed up computation, separate samples into many groups. Additionally, certain vcf and bed files must be included to further boost the quality of the called variants (refer to `files-to-download`_).

===================== ========= ================================================================================================================================================================================================ =========
Parameter Type Description Default/Examples
Parameter Type Description Default/Examples
===================== ========= ================================================================================================================================================================================================ =========
run boolean If true, the module will be run. If false, it will be ignored. ``true``
groups list Same as in aberrant expression. ``# see aberrant expression example``
Expand Down Expand Up @@ -226,6 +226,34 @@ column order does not matter. Also, it does not matter where it is stored, as th
specified in the config file. Here we provide some examples on how to deal with certain
situations. For simplicity, we do not include all possible columns in the examples.

===================== ========= ================================================================================================================================================================================================ ==========================
Parameter Type Description Default/Examples
===================== ========= ================================================================================================================================================================================================ ==========================
RNA_ID character Unique identifier of an RNA assay. ``sample1``
RNA_BAM_FILE character Absolute path to the BAM file derived from RNA-seq. A BAM file can belong to only one RNA_ID and vice versa. ``path/to/sample1.bam``
DNA_ID character Unique identifier of a DNA assay. ``sample1``
DNA_VCF_FILE character Absolute path to the DNA VCF file. The DNA_ID has to match the ID inside the VCF file. In case a multisample VCF is used, write the file name for each sample. ``path/to/sample1.vcf``
DROP_GROUP character The analysis group(s) that the RNA assay belongs to. Multiple groups must be separated by commas and no spaces (e.g. blood,WES,groupA). We recommend doing a different analysis for each tissue
as gene expression and splicing is tissue specific. ``group1,group2``
PAIRED_END boolean Either TRUE or FALSE, depending on whether the sample comes from paired-end RNA-seq or not. ``TRUE``
COUNT_MODE character Either ``Union``, ``IntersectionStrict`` or ``IntersectionNotEmpty``. Refer to the documentation of HTSeq for details. ``IntersectionStrict``
COUNT_OVERLAPS character Either TRUE or FALSE, depending on whether reads overlapping different regions are allowed and counted. ``TRUE``
STRAND character Either yes, no, or reverse: ``no`` means that the sequencing was not strand specific; ``yes`` that it was strand specific, and the first read in the pair is on the same strand as the feature
and the second read on the opposite strand; and ``reverse`` that the sequencing is strand specific and the first read in the pair is on the opposite strand to the feature and the second read
on the same strand. ``no``
HPO_TERMS character Comma-separated phenotypes encoded as HPO terms. ``HP:0001479,HP:0005591``
GENE_COUNTS_FILE character (Only required for aberrant expression external samples) Absolute path to the external gene-level count matrix. ``/path/to/gene_counts/geneCounts.tsv.gz``
GENE_ANNOTATION character (Only required for aberrant expression external samples) Gene annotation used to obtain the count matrix. Must correspond to the key of an entry in the geneAnnotation parameter of the config
file. ``v29``
GENOME character (Optional) Either ``ncbi`` or ``ucsc`` indicating the reference genome assembly. ``ncbi``
SPLICE_COUNTS_DIR character (Only required for aberrant splicing external samples) Absolute path to the directory containing the external files required for aberrant splicing module. ``/path/to/splicing_dir/``
SEX character (Optional) Either ``m``, ``male`` or ``f``, ``female`` or empty for samples with unknown sex values. When provided, a sex matching algorithm will be run on the RNA-seq counts and a sex value
will be predicted for all samples. ``m``
TISSUE character (Optional) Recommended to be provided when exporting counts. ``BRAIN``
DISEASE character (Optional) Recommended to be provided when exporting counts. ``AML``
===================== ========= ================================================================================================================================================================================================ ==========================



Using External Counts
++++++++++++++++++++++++++++++++++
Expand Down
10 changes: 9 additions & 1 deletion drop/config/SampleAnnotation.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ def __init__(self, file, root, genome):
# external counts
self.extGeneCountIDs = self.createGroupIds(file_type="GENE_COUNTS_FILE", sep=',')
self.extSpliceCountIDs = self.createGroupIds(file_type="SPLICE_COUNTS_DIR", sep=',')

self.checkNonExternalGeneAnnotation()

def parse(self, sep='\t'):
"""
read and check sample annotation for missing columns
Expand Down Expand Up @@ -329,3 +330,10 @@ def getIDsByGroup(self, group, assay="RNA"):
def getSampleIDs(self, file_type):
ids = self.subsetFileMapping(file_type)["ID"]
return list(ids)

def checkNonExternalGeneAnnotation(self):
external_groups = set([g for g in self.extGeneCountIDs if len(self.extGeneCountIDs[g]) > 0])
non_external_samples = self.annotationTable[self.annotationTable['DROP_GROUP'].isin(external_groups) == False]
if sum(non_external_samples['GENE_ANNOTATION'].isna() == False) > 0:
logger.info("WARNING: Found %d samples that had `GENE_ANNOTATION` provided in sample annotation table but are not external samples. The provided GENE_ANNOTATIONs are ignored.\n" % (sum(non_external_samples['GENE_ANNOTATION'].isna() == False)))
self.annotationTable.loc[self.annotationTable['DROP_GROUP'].isin(non_external_samples) == False, "GENE_ANNOTATION"] = ""
8 changes: 4 additions & 4 deletions drop/demo/config_relative.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,16 +42,16 @@ aberrantSplicing:
keepNonStandardChrs: false
filter: false
minExpressionInOneSample: 20
quantileMinExpression: 10
quantileForFiltering: 0.95
quantileMinExpression: 10
minDeltaPsi: 0.05
implementation: PCA
padjCutoff: 1
maxTestedDimensionProportion: 6
genesToTest: "Data/genes_to_test.yaml"
### FRASER1 configuration for demo dataset
FRASER_version: "FRASER"
### FRASER2 configuration for demo dataset
FRASER_version: "FRASER2"
deltaPsiCutoff : 0.05
quantileForFiltering: 0.95

mae:
run: true
Expand Down
6 changes: 4 additions & 2 deletions drop/modules/aberrant-expression-pipeline/Counting/Summary.R
Original file line number Diff line number Diff line change
Expand Up @@ -238,8 +238,10 @@ if(isEmpty(sex_idx)){
annotation_logticks(sides = 'bl') +
labs(color = 'Sex', shape = 'Predicted sex', alpha = 'Matches sex')
plot(g)

DT::datatable(sex_dt[match_sex == F], caption = 'Sex mismatches')
if(sex_dt[match_sex == F, .N] > 0){
print('Sex mismatches:')
print(sex_dt[match_sex == F])
}
} else {
g <- ggplot(sex_dt, aes(XIST+1, UTY+1)) + geom_point(aes(col = SEX)) +
scale_x_log10(limits = c(1,NA)) + scale_y_log10(limits = c(1,NA)) +
Expand Down
16 changes: 8 additions & 8 deletions drop/template/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,14 @@ aberrantSplicing:
padjCutoff: 0.1
maxTestedDimensionProportion: 6
genesToTest: null
### FRASER1 configuration
FRASER_version: "FRASER"
deltaPsiCutoff : 0.3
quantileForFiltering: 0.95
### For FRASER2, use the follwing parameters instead of the 3 lines above:
# FRASER_version: "FRASER2"
# deltaPsiCutoff : 0.1
# quantileForFiltering: 0.75
### FRASER2 configuration
FRASER_version: "FRASER2"
deltaPsiCutoff : 0.1
quantileForFiltering: 0.75
### For FRASER1, use the follwing parameters instead of the 3 lines above:
# FRASER_version: "FRASER"
# deltaPsiCutoff : 0.3
# quantileForFiltering: 0.95

mae:
run: true
Expand Down
2 changes: 1 addition & 1 deletion tests/config/test_AS.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ def test_config(self, dropConfig,demo_dir):
'deltaPsiCutoff': 0.05,
'maxTestedDimensionProportion': 6,
'genesToTest': 'Data/genes_to_test.yaml',
'FRASER_version': 'FRASER'
'FRASER_version': 'FRASER2',
}
assert dict_.items() <= dropConfig.AS.dict_.items()

Expand Down
7 changes: 5 additions & 2 deletions tests/pipeline/test_AS.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,9 @@ def test_results(self, demo_dir):
annotation = "v29"
dataset = "fraser"
r = run(f"wc -l {results_dir}/{annotation}/fraser/{dataset}/results_per_junction.tsv", demo_dir)
assert "4137" == r.stdout.split()[0]
assert "980" == r.stdout.split()[0]
r = run(f"wc -l {results_dir}/{annotation}/fraser/{dataset}/results.tsv", demo_dir)
assert "310" == r.stdout.split()[0]
assert "194" == r.stdout.split()[0]
r = run(f"wc -l {results_dir}/{annotation}/fraser/{dataset}/results_gene_all.tsv", demo_dir)
assert "391" == r.stdout.split()[0]

Loading