Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge external splicing counts #247

Merged
merged 73 commits into from
Apr 22, 2022
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
4b6b0bf
initial merge of external splicing counts for FRASER
c-mertes Feb 10, 2021
e0e5844
fix download and add more test cases
c-mertes Feb 10, 2021
7e59764
fix test
c-mertes Feb 10, 2021
434135d
fix wget download and heatmap plotting
c-mertes Feb 10, 2021
d353a56
adapt to new naming of sampleannotation
c-mertes Aug 11, 2021
2712658
use only exact matching in subsetBy related to #244
c-mertes Aug 11, 2021
f85e130
fix merge of subsetGroups function related to: #246
c-mertes Aug 12, 2021
91566f6
fix snakemake file dependency after merging external counts.
c-mertes Aug 12, 2021
0cf5832
correct naming
c-mertes Aug 12, 2021
b1be9d3
cleanup code
c-mertes Aug 12, 2021
6e0467c
update FRASER dependency for merge count functionality
c-mertes Aug 13, 2021
48a0ab2
Merge branch 'dev' into new_external_merge_splicing
c-mertes Oct 11, 2021
b86f008
Merge branch 'dev'
nickhsmith Mar 17, 2022
971a401
merge with dev
nickhsmith Mar 28, 2022
fdfd3cd
change input/output paths.
Mar 29, 2022
39744c9
add symlinks
Mar 30, 2022
d7e0894
add explicit biallelic filter
nickhsmith Mar 30, 2022
2f81989
update regex matching
nickhsmith Mar 31, 2022
d1f60cd
snakemake 7 workarounds
Mar 31, 2022
ce0a75f
Merge branch 'small_fix' into new_external_merge_splicing
nickhsmith Mar 31, 2022
a5f8de0
Update to MAE filter scripts
kvn95ss Mar 22, 2022
4dbcf0e
update backend for externalCounts
nickhsmith Mar 31, 2022
5c40c88
remove importExport for test
nickhsmith Mar 31, 2022
fa12be8
comments and cleanup
nickhsmith Apr 1, 2022
ab7598f
rename demo groups
nickhsmith Apr 1, 2022
4b385c3
more information with external counts
Apr 1, 2022
9d68b2f
Update README.md
vyepez88 Apr 1, 2022
a079606
update with fdsMerge
Apr 1, 2022
f98ca7c
change group names
Apr 4, 2022
39b5590
comments
nickhsmith Apr 4, 2022
1d3d994
Merge branch 'small_fix' of github.com:gagneurlab/drop into small_fix
nickhsmith Apr 4, 2022
f6ea598
AE summary
Apr 5, 2022
41b9d21
Summary styling
nickhsmith Apr 5, 2022
f76b741
update splicing summary and comments
nickhsmith Apr 5, 2022
ab4545b
format summary
nickhsmith Apr 6, 2022
d83ec49
external counts documentation
nickhsmith Apr 6, 2022
6505a65
documentation and updating
nickhsmith Apr 6, 2022
83ca561
update MAE summary and results
nickhsmith Apr 7, 2022
cd5f487
format overview
nickhsmith Apr 7, 2022
16ef35c
Overview code block
nickhsmith Apr 7, 2022
82870ac
update QC matching
nickhsmith Apr 7, 2022
c9f3585
process NA rare
nickhsmith Apr 7, 2022
3f23198
docs
nickhsmith Apr 7, 2022
5bf0d44
mae cutoffs to get results
nickhsmith Apr 7, 2022
14edfc4
update docs
nickhsmith Apr 7, 2022
27549c9
update docs
nickhsmith Apr 7, 2022
cfb8309
update docs
nickhsmith Apr 7, 2022
0e970ee
update output docs
nickhsmith Apr 7, 2022
d298002
typo
nickhsmith Apr 7, 2022
5c038b9
Merge branch 'small_fix' into new_external_merge_splicing
nickhsmith Apr 7, 2022
b25a9d4
fix cutoffs and plotting
nickhsmith Apr 7, 2022
aa9dd7b
MAE results test
nickhsmith Apr 7, 2022
0141dc8
update test to match demo config
nickhsmith Apr 7, 2022
c100b74
allow for legacy sample annotation
nickhsmith Apr 8, 2022
9e9d909
improve legacy handling
nickhsmith Apr 8, 2022
0eb78fc
update FRASER version requiremtent
Apr 11, 2022
ef65020
fix column typo
Apr 11, 2022
0973a53
update plots to match config
Apr 11, 2022
c363c11
update
Apr 11, 2022
59c4d2a
Update README.md
vyepez88 Apr 12, 2022
4c83f2d
Clarifications added to possible QC values
vyepez88 Apr 12, 2022
420d31c
Update DNA_RNA_matrix_plot.R
vyepez88 Apr 12, 2022
2322f5f
code review formatting fixes
nickhsmith Apr 13, 2022
eae85db
Merge branch 'small_fix' into new_external_merge_splicing
nickhsmith Apr 13, 2022
46df827
update docs
nickhsmith Apr 13, 2022
4264ff0
html outputs
nickhsmith Apr 13, 2022
9daef0e
MAE plot xlim
nickhsmith Apr 13, 2022
a04da62
Merge branch 'dev' into new_external_merge_splicing
nickhsmith Apr 21, 2022
5917caa
code-review fixes
nickhsmith Apr 22, 2022
2d220f8
Update output.rst
nickhsmith Apr 22, 2022
ea1b26d
Update output.rst
nickhsmith Apr 22, 2022
d15a702
Update output.rst
nickhsmith Apr 22, 2022
1a628cb
Update output.rst
vyepez88 Apr 22, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The manuscript is available in [Nature Protocols](https://www.nature.com/article
DROP is available on [bioconda](https://anaconda.org/bioconda/drop).
We recommend using a dedicated conda environment. (installation time: ~ 10min)
```
mamba install -c conda-forge -c bioconda drop
mamba create -n drop_env -c conda-forge -c bioconda drop
```

Test installation with demo project
Expand Down
2 changes: 1 addition & 1 deletion drop/config/ExportCounts.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def __init__(
self.CONFIG_KEYS = ["geneAnnotations", "excludeGroups"]
self.config_dict = self.setDefaults(dict_, genome.annotation)
self.outputRoot = outputRoot / "exported_counts"
self.sa = sampleAnnotation
self.sampleAnnotation = sampleAnnotation
self.genomeAssembly = genome.assembly
self.geneAnnotations = self.get("geneAnnotations")
self.modules = {
Expand Down
54 changes: 32 additions & 22 deletions drop/config/SampleAnnotation.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@


class SampleAnnotation:
FILE_TYPES = ["RNA_BAM_FILE", "DNA_VCF_FILE", "GENE_COUNTS_FILE"]
FILE_TYPES = ["RNA_BAM_FILE", "DNA_VCF_FILE", "GENE_COUNTS_FILE", "SPLICE_COUNTS_DIR"]
SAMPLE_ANNOTATION_COLUMNS = FILE_TYPES + [
"RNA_ID", "DNA_ID", "DROP_GROUP", "GENE_ANNOTATION",
"PAIRED_END", "COUNT_MODE", "COUNT_OVERLAPS", "STRAND", "GENOME"
Expand All @@ -32,6 +32,7 @@ def __init__(self, file, root, genome):
self.dnaIDs = self.createGroupIds(file_type="DNA_VCF_FILE", sep=',')
# external counts
self.extGeneCountIDs = self.createGroupIds(file_type="GENE_COUNTS_FILE", sep=',')
self.extSpliceCountIDs = self.createGroupIds(file_type="SPLICE_COUNTS_DIR", sep=',')

def parse(self, sep='\t'):
"""
Expand Down Expand Up @@ -80,10 +81,12 @@ def createSampleFileMapping(self):
columns: [ID | ASSAY | FILE_TYPE | FILE_PATH ]
"""

assay_mapping = {'RNA_ID': ['RNA_BAM_FILE', 'GENE_COUNTS_FILE'], 'DNA_ID': ['DNA_VCF_FILE']}
assay_mapping = {'RNA_ID': ['RNA_BAM_FILE', 'GENE_COUNTS_FILE', 'SPLICE_COUNTS_DIR'], 'DNA_ID': ['DNA_VCF_FILE']}
assay_subsets = []
for id_, file_types in assay_mapping.items():
for file_type in file_types:
if file_type not in self.annotationTable.columns:
continue
df = self.annotationTable[[id_, file_type]].dropna().drop_duplicates().copy()
df.rename(columns={id_: 'ID', file_type: 'FILE_PATH'}, inplace=True)
df['ASSAY'] = id_
Expand Down Expand Up @@ -142,21 +145,20 @@ def createGroupIds(self, group_key="DROP_GROUP", file_type=None, sep=','):
groups = set(groups)

# collect IDs per group
grouped = {gr: df[df[group_key].str.contains(f'(^|{sep}){gr}({sep}|$)')][assay_id].tolist()
grouped = {gr: df[df[group_key].str.contains(f'(?:^|{sep}){gr}(?:{sep}|$)')][assay_id].tolist()
for gr in groups}
# remove groups labeled as None
grouped = {gr: list(set(ids)) for gr, ids in grouped.items() if gr is not None}
return grouped

### Subsetting

def subsetSampleAnnotation(self, column, values, subset=None, exact_match=True):
def subsetSampleAnnotation(self, column, values, subset=None):
"""
subset by one or more values of different columns from sample file mapping
:param column: valid column in sample annotation
:param values: values of column to subset
:param subset: subset sample annotation
:param exact_match: whether to match substrings in the sample annotation, false allows substring matching
"""
sa_cols = set(self.SAMPLE_ANNOTATION_COLUMNS)
if subset is None:
Expand All @@ -171,7 +173,7 @@ def subsetSampleAnnotation(self, column, values, subset=None, exact_match=True):
# check if column is valid
if column not in sa_cols:
raise KeyError(f"Column '{column}' not present in sample annotation.")
return utils.subsetBy(subset, column, values, exact_match=exact_match)
return utils.subsetBy(subset, column, values)

def subsetFileMapping(self, file_type=None, sample_id=None):
"""
Expand Down Expand Up @@ -230,10 +232,9 @@ def getFilePaths(self, file_type, group=None):
return self.getFilePath(sampleIDs, file_type, single_file=False)

# build a dictionary from the drop group and column. like getImportCounts with skipping options and dict output
def getGenomes(self, value, group, file_type="RNA_ID",
column="GENOME", group_key="DROP_GROUP",exact_match = True,skip = False):
def getGenomes(self, value, group, file_type="RNA_ID", column="GENOME", group_key="DROP_GROUP", skip = False):
"""
:param value: values to match in the column. Must be an exact match, passed to subsetting sample annotation
:param value: values to match in the column.
:param group: a group of the group_key (DROP_GROUP) column.
:return: dict file_type to column
"""
Expand All @@ -242,24 +243,30 @@ def getGenomes(self, value, group, file_type="RNA_ID",
if skip:
subset = None
else:
subset = self.subsetSampleAnnotation(column, value,exact_match=True)
subset = self.subsetSampleAnnotation(column, value)

# additionally subset for the group_key and the group
subset = self.subsetSampleAnnotation(group_key, group, subset,exact_match=exact_match)
subset = self.subsetSampleAnnotation(group_key, group, subset)

return {sample_id: value for sample_id in subset[file_type].tolist()}

def getImportCountFiles(self, annotation, group, file_type="GENE_COUNTS_FILE",
annotation_key="GENE_ANNOTATION", group_key="DROP_GROUP",exact_match = True):
def getImportCountFiles(self, annotation, group, file_type: str = "GENE_COUNTS_FILE",
annotation_key: str = "GENE_ANNOTATION", group_key: str = "DROP_GROUP",
asSet: bool = True):
"""
:param annotation: annotation name as specified in config and GENE_ANNOTATION column
:param group: a group of the DROP_GROUP column. exact match is passed to subsetter, false allows for substring matching
:param annotation: annotation name as specified in config and GENE_ANNOTATION column. Can be None
:param group: a group of the DROP_GROUP column.
:return: set of unique external count file names
"""

#subset for the annotation_key in the annotation group and the group_key in the group
subset = self.subsetSampleAnnotation(annotation_key, annotation,exact_match=exact_match)
subset = self.subsetSampleAnnotation(group_key, group, subset,exact_match=False)
return set(subset[file_type].tolist())
subset = self.subsetSampleAnnotation(annotation_key, annotation)
subset = self.subsetSampleAnnotation(group_key, group, subset)

ans = subset[file_type].tolist()
if asSet:
ans = set(ans)
return ans

def getRow(self, column, value):
sa = self.annotationTable
Expand All @@ -277,15 +284,18 @@ def getGroupedIDs(self, assays):
Get group to IDs mapping
:param assays: list of or single assay the IDs should be from. Can be file_type or 'RNA'/'DNA'
"""
assays = [assays] if isinstance(assays, str) else assays
if isinstance(assays, str):
assays = [assays]
groupedIDs = defaultdict(list)
for assay in assays:
if "RNA" in assay:
groupedIDs.update(self.rnaIDs)
utils.deep_merge_dict(groupedIDs, self.rnaIDs, inplace=True)
elif "DNA" in assay:
groupedIDs.update(self.dnaIDs)
utils.deep_merge_dict(groupedIDs, self.dnaIDs, inplace=True)
elif "GENE_COUNT" in assay:
groupedIDs.update(self.extGeneCountIDs)
utils.deep_merge_dict(groupedIDs, self.extGeneCountIDs, inplace=True)
elif "SPLICE_COUNT" in assay:
utils.deep_merge_dict(groupedIDs, self.extSpliceCountIDs, inplace=True)
else:
raise ValueError(f"'{assay}' is not a valid assay name")
return groupedIDs
Expand Down
2 changes: 1 addition & 1 deletion drop/config/submodules/AberrantExpression.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def __init__(self, config, sampleAnnotation, processedDataDir, processedResultsD
please fix to only have either external count or BAM processing\n")

# check number of IDs per group
all_ids = {g: self.rnaIDs[g] + self.extRnaIDs[g] for g in self.groups}
all_ids = self.sampleAnnotation.subsetGroups(self.groups, assay=["RNA", "GENE_COUNTS"])
self.checkSubset(all_ids)

def setDefaultKeys(self, dict_):
Expand Down
32 changes: 28 additions & 4 deletions drop/config/submodules/AberrantSplicing.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
import numpy as np
import pandas as pd

from snakemake.io import expand

from drop import utils
Expand All @@ -16,8 +19,11 @@ def __init__(self, config, sampleAnnotation, processedDataDir, processedResultsD
# if self.run is false return without doing any config/sa checks for completeness
if not self.run:
return
self.rnaIDs = self.sampleAnnotation.subsetGroups(self.groups, assay="RNA")
self.checkSubset(self.rnaIDs)

self.rnaIDs = self.sampleAnnotation.subsetGroups(self.groups, assay="RNA")
self.rnaExIDs = self.sampleAnnotation.subsetGroups(self.groups, assay="SPLICE_COUNT")
all_ids = self.sampleAnnotation.subsetGroups(self.groups, assay=["RNA", "SPLICE_COUNT"])
self.checkSubset(all_ids)

def setDefaultKeys(self, dict_):
super().setDefaultKeys(dict_)
Expand All @@ -44,7 +50,7 @@ def getSplitCountFiles(self, dataset):
:return: list of files
"""
ids = self.sampleAnnotation.getIDsByGroup(dataset, assay="RNA")
file_stump = self.processedDataDir / "aberrant_splicing" / "datasets" / "cache" / f"raw-{dataset}" / \
file_stump = self.processedDataDir / "aberrant_splicing" / "datasets" / "fromBam" / "cache" / f"raw-{dataset}" / \
"sample_tmp" / "splitCounts"
done_files = str(file_stump / "sample_{sample_id}.done")
return expand(done_files, sample_id=ids)
Expand All @@ -56,7 +62,25 @@ def getNonSplitCountFiles(self, dataset):
:return: list of files
"""
ids = self.sampleAnnotation.getIDsByGroup(dataset, assay="RNA")
file_stump = self.processedDataDir / "aberrant_splicing" / "datasets" / "cache" / f"raw-{dataset}" / \
file_stump = self.processedDataDir / "aberrant_splicing" / "datasets" / "fromBam" / "cache" / f"raw-{dataset}" / \
"sample_tmp" / "nonSplitCounts"
done_files = str(file_stump / "sample_{sample_id}.done")
return expand(done_files, sample_id=ids)


def getExternalCounts(self, group: str, fileType: str = "k_j_counts"):
"""
Get externally provided splice count data dir based on the given group.
If a file type is given the corresponding file within the folder is returned.
:param group: DROP group name from wildcard
:param fileType: name of the file without extension which is to be returned
:return: list of directories or files
"""
ids = self.sampleAnnotation.getIDsByGroup(group, assay="SPLICE_COUNT")
extCountFiles = self.sampleAnnotation.getImportCountFiles(annotation=None, group=group,
file_type="SPLICE_COUNTS_DIR", asSet=False)
if fileType is not None:
extCountFiles = np.asarray(extCountFiles)[pd.isna(extCountFiles) == False].tolist()
extCountFiles = [x + "/" + fileType + ".tsv.gz" for x in extCountFiles]
return extCountFiles

10 changes: 5 additions & 5 deletions drop/config/submodules/MonoallelicExpression.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def setDefaultKeys(self, dict_):
return dict_

def checkConfigSampleannotation(self):
subset = self.sampleAnnotation.subsetSampleAnnotation("DROP_GROUP", self.groups, exact_match=False)
subset = self.sampleAnnotation.subsetSampleAnnotation("DROP_GROUP", self.groups)

if len(self.genomeFiles.keys()) > 1: # more than 1 value in config defined genome dictionary
if "GENOME" not in subset.columns.values: # GENOME column not defined
Expand Down Expand Up @@ -149,24 +149,24 @@ def setGenomeDict(self, genomeFiles):
if len(genomeFiles) == 1: # globally defined in the config
globalGenome = list(genomeFiles.values())[0]

# subset SA by the drop group (not exact match) and skip the filtering by SA-GENOME column
# subset SA by the drop group and skip the filtering by SA-GENOME column
genomeDict = self.sampleAnnotation.getGenomes(
globalGenome,
self.groups,
file_type="RNA_ID",
column="DROP_GROUP", group_key="DROP_GROUP",
exact_match=False, skip=True
skip=True
)
else:
# subset SA by the drop group (not exact match) and filter by SA-GENOME column. Must exactly match config key
# subset SA by the drop group and filter by SA-GENOME column. Must exactly match config key
for gf in genomeFiles.keys():
genomeDict.update(
self.sampleAnnotation.getGenomes(
gf,
self.groups,
file_type="RNA_ID",
column="GENOME", group_key="DROP_GROUP",
exact_match=False, skip=False
skip=False
)
)

Expand Down
6 changes: 4 additions & 2 deletions drop/demo/config_relative.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,14 @@ exportCounts:
- v29
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to maintain it twice the file? In the code base and in the resource tar file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not in the resource tar

excludeGroups:
- mae
- import_exp
- outrider_external
- fraser_external

aberrantExpression:
run: true
groups:
- outrider
- import_exp
- outrider_external
fpkmCutoff: 1
implementation: autoencoder
padjCutoff: 1
Expand All @@ -36,6 +37,7 @@ aberrantSplicing:
run: true
groups:
- fraser
- fraser_external
recount: true
longRead: false
keepNonStandardChrs: true
Expand Down
Loading