Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] pe correction, low_complexity_filter & abund_cutoff as config params #107

Closed
wants to merge 13 commits into from
13 changes: 13 additions & 0 deletions doc/configuring.md
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,19 @@ genbank_cache: ./genbank_cache
# the default is set for the all-Genbank database.
# DEFAULT: 100e9
prefetch_memory: 100e9


### OTHER PARAMETERS

## Error trimming flags
# fastp_correction: set to ON or 1 for base correction when overlapping PE are present
fastp_correction: OFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use YAML booleans - it looks like OFF is a valid one (per https://yaml.org/type/bool.html), but your code below doesn't use boolean value checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that checking for boolean?

correction_flag = "--correction" if config_correction in [True, '1'] else ''

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'if' here worries me - if it's a boolean, then you should just be able to say if config_correction. Most specifically, '1' should not be magic (which it seems to be here). I don't really know how snakemake's config yaml parser interprets valid vs invalid YAML boolean values, but I would be disturbed to see 'yes' interpreted as false here.

YAML doesn't seem to have a simple answer for what is a valid bool - this is a mess! -

 y|Y|yes|Yes|YES|n|N|no|No|NO
|true|True|TRUE|false|False|FALSE
|on|On|ON|off|Off|OFF

I guess the simplest thing to do would be to insist that it be a valid bool (True or False), not a '1' or anything else, and complain if it's not.

(I'm disgruntled with YAML because of this kind of thing - Luiz mentioned an increasing preference for TOML, because YAML isn't really that well specified.)

# fastp_low_complexity: set to ON or 1 for applying low complexity filter
fastp_low_complexity: OFF

## k-mers abundance trimming
# remove k-mers below this abundance from high-abundance reads; does not affect low-abundance reads
abundtrim_cutoff: 3
```

## More advanced genome-grist usage
Expand Down
30 changes: 23 additions & 7 deletions genome_grist/conf/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ if not base_tempdir:
sys.exit(-1)

ABUNDTRIM_MEMORY = float(config.get('metagenome_trim_memory', '1e9'))

ABUNDTRIM_CUTOFF = config.get("abundtrim_cutoff", '3')
GENBANK_CACHE = config.get('genbank_cache', './genbank_cache/')
GENBANK_CACHE = os.path.normpath(GENBANK_CACHE)

Expand Down Expand Up @@ -90,6 +90,17 @@ def make_param_str(ksizes, scaled):

###

# handle `fastp` params.
def handle_fastp():
config_correction = config.get("fastp_correction", 'OFF')
config_low_complexity = config.get("fastp_low_complexity", 'OFF')
correction_flag = "--correction" if config_correction in [True, '1'] else ''
low_complexity_flag = "--low_complexity_filter" if config_low_complexity in [True, '1'] else ''
params_string = f"{correction_flag} {low_complexity_flag}"
return params_string

###

# utility function
def load_csv(filename):
with open(filename, "r", newline="") as fp:
Expand Down Expand Up @@ -493,6 +504,8 @@ rule trim_adapters_wc:
interleaved = protected(outdir + '/trim/{sample}.trim.fq.gz'),
json=outdir + "/trim/{sample}.trim.json",
html=outdir + "/trim/{sample}.trim.html",
params:
extra_params = handle_fastp()
conda: 'env/trim.yml'
threads: 4
resources:
Expand All @@ -502,9 +515,9 @@ rule trim_adapters_wc:
shell: """
fastp --in1 {input.r1} --in2 {input.r2} \
--detect_adapter_for_pe --qualified_quality_phred 4 \
--length_required 25 --correction --thread {threads} \
--length_required 25 {params.extra_params} --thread {threads} \
--json {output.json} --html {output.html} \
--low_complexity_filter --stdout | gzip -9 > {output.interleaved}
--stdout | gzip -9 > {output.interleaved}
"""

# adapter trimming for the singleton reads
Expand All @@ -515,6 +528,8 @@ rule trim_unpaired_adapters_wc:
unp = protected(outdir + '/trim/{sample}_unpaired.trim.fq.gz'),
json = protected(outdir + '/trim/{sample}_unpaired.trim.json'),
html = protected(outdir + '/trim/{sample}_unpaired.trim.html'),
params:
extra_params = handle_fastp()
threads: 4
resources:
mem_mb=5000,
Expand All @@ -523,9 +538,9 @@ rule trim_unpaired_adapters_wc:
conda: 'env/trim.yml'
shell: """
fastp --in1 {input.unp} --out1 {output.unp} \
--detect_adapter_for_se --qualified_quality_phred 4 \
--low_complexity_filter --thread {threads} \
--length_required 25 --correction \
--detect_adapter_for_se {params.extra_params} --qualified_quality_phred 4 \
--thread {threads} \
--length_required 25 \
--json {output.json} --html {output.html}
"""

Expand All @@ -540,8 +555,9 @@ rule kmer_trim_reads_wc:
mem_mb = int(ABUNDTRIM_MEMORY / 1e6),
params:
mem = ABUNDTRIM_MEMORY,
cutoff = ABUNDTRIM_CUTOFF,
shell: """
trim-low-abund.py -C 3 -Z 18 -M {params.mem} -V \
trim-low-abund.py -C {params.cutoff} -Z 18 -M {params.mem} -V \
{input.interleaved} -o {output} --gzip
"""

Expand Down