Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binning & bin refinement extension #252

Closed
jfy133 opened this issue Nov 16, 2021 · 7 comments · Fixed by #263
Closed

Binning & bin refinement extension #252

jfy133 opened this issue Nov 16, 2021 · 7 comments · Fixed by #263
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@jfy133
Copy link
Member

jfy133 commented Nov 16, 2021

In addition to the aDNA specific extension, we would like to include some expanding of the binning options and also binning refinement.

This was originally written in a snakemake workflow by @alexhbnr, but it would fit in quite well here. So I would start writing this proposal already and ship probably over a couple of PRs.

I'm leaving the following diagrams as follows for discussion, and I'll also add personal 'dev notes' while I start preparing the new workflow proposals:

From @alexhbnr

IMG_20211008_133244144_1
IMG_20211008_133249834_2

Dev Notes

2021-11-25

  • versions not reported still (this will require a larger re-write of MAG to use all 'official' nf-core/modules)
  • x] naming scheme for the depths files for binning only partially implemented. Can't work out how to dynamically get the meta.assembler in there yet due to $suffix behaviour. Gregor/Mahesh/Harshil have posted suggestions on #DSL2-transition on slack

2021-11-29

  • replaced most of original metabat_binning, but need to work out how to gunzip all binned fastas for prokka
  • had to make some changes to metabat2 and possibly GUNZIP, will need to push changes upstream

2021-12-02

  • gunzip solution implemented with a .transpose and groupTuple
  • METABAT2 still needs to be updated for unbinned to be exporte,d but ahving issues conditionally exporting them

2021-12-07

  • Seem to have a working version of MetaBAT2 now, PR is open for nf-core/modules
  • Issues getting output files in the right directories, currently tooShort/lowDepth not being published
  • one of the get_mag_depth output goes to a results directory called mag/, not sure why

2021-12-07

  • Output in right directories was due to the splitting up of each section into different modules, rather than all in one. Therefore output fils were overwriting each other - I've tried to split these up in a way that makes sense.
  • Still stuck on why mag_depths* modules publish a single sample and into the root of results directory called mag...

2022-01-13

  • Fixed the mag_depths not publishing in the write place issue (had commented out the addParams DSL2 v1 syntax, reactivated made it work as expected
  • Outstanding issue: JGICONTIGDEPTH depth.txt.gz file is not being published (and a couple of other files).
    • Suspicioun: two modules cannot publish in the same output directory?
    • Actual: prefix was defined in as suffix, so output file pattern wasn't working 🤦
    • RESOLVED!
  • Outstanding issue: minigut2 is not producing heatmap for depth/some binning for some reason
    • unbinned minigut_sample2 fies isn't not being published either (just the lowdepth/tooshort), even though it is generated by METABAT2 <- this is because the entire unbinned directory is published in SPLIT_FASTQ, so it's overwriting it... 🤦‍♀ [RESOLVED - all files now published, could consider maching publication directories again]
    • Suspicion: heatmap not being generated for sample2, as unbinned files not being mixed with the binned FASTAs. Need to check which unbinned files merged for the depth plot

2022-02-09

  • Starting work on MaxBin2
  • Depth file from bioawk seems to be broken?
  • Still need to: add to MB2 to modules.config for publication
  • Workout how to merge MB2 binning results down stream (add info to meta, etc)

2022-02-11

  • Depths working, both binners running

  • Problem with publishing discarded/unbinned output from split_fastq back into the respective binner directories

    • It seems that split_fastq wants to 'overwrite' the existing discarded/unbinned directories, or get's blocked from publishing into it?
    • i.e. path: { "${params.outdir}/GenomeBinning/${meta.binner}/discarded/" },
      doesn't work

    image

  • Need to check why MAG_DEPTHS isn't executing working now with extra assemblers and output looks the same

    • possibly because depth files aren't groupBying with their corresponding FASTAs, as we've inserted the binner metadata
  • Currently testing with BUSCO disactivated, need to test with it on

2022-02-18

  • THE ISSUE WITH MAG_DPETHS MIGHT BE I NEED TO MAKE A CLONE OF META WHEN MODIFYING. Thanks to @Midnighter for the tip...
@jfy133 jfy133 added the enhancement New feature or request label Nov 16, 2021
@jfy133
Copy link
Member Author

jfy133 commented Nov 16, 2021

At the moment I think I would split into two PRs:

  1. binning itself
    • replace local metabat2 with the new official nf-core modules
    • add maxbin2
  2. bin refinement

binning

  • Should probably separate out the metabat2 binning from bowtie2 mapping and generating the depths (e.g., binning_prep)
    • depths already seem to be generated with a custom python script get_depth.py (or similar), would need to adapt Alex's bioawk -t '{{if (NR > 1){{print $1, $3}}}}' {input} > {output} cmd for maxbin2
  • The output of this binning then goes into separate metabat2 subworkflow & maxbin workflow

@jfy133 jfy133 self-assigned this Nov 16, 2021
@d4straub
Copy link
Collaborator

d4straub commented Nov 17, 2021

Looks good. Some remarks that you might or might not be aware of:

  • mapping of short reads to all assemblies producing .bam files is realized in modules modules/local/bowtie2_assembly_build.nf and modules/local/bowtie2_assembly_align.nf
  • modules/local/metabat2.nf
    (1) produces contig sequencing depths for MetaBAT2 depending on that bam files (I somewhere read that OMP_NUM_THREADS would be not used here, but thats wrong, in-line VAR are used without export):
    OMP_NUM_THREADS=${task.cpus} jgi_summarize_bam_contig_depths --outputDepth depth.txt ${bam}

    (2) runs MetaBAT2
    metabat2 -t "${task.cpus}" -i "${assembly}" -a depth.txt -o "MetaBAT2/${meta.assembler}-${meta.id}" -m ${params.min_contig_size} --unbinned --seed ${params.metabat_rng_seed}

    (3) saves long unbinned contigs (because unbinned contigs are otherwise ignored by downstream analysis, I once had a 4 Mbp contig ignored and therefore this code was added).
    split_fasta.py "MetaBAT2/${meta.assembler}-${meta.id}.unbinned.fa" ${params.min_length_unbinned_contigs} ${params.max_unbinned_contigs} ${params.min_contig_size}

    (4) outputs the compressed depth file
    mv depth.txt.gz "${meta.assembler}-${meta.id}-depth.txt.gz"
  • depths for each MAG are produces with get_mag_depths.py in modules/local/mag_depths.nf, those are for reporting (modules/local/mag_depths_plot.nf & modules/local/mag_depths_summary.nf) and not further used for binning
    get_mag_depths.py --bins ${bins} \
    --depths ${contig_depths} \
    --assembly_name "${meta.assembler}-${meta.id}" \
    --out "${meta.assembler}-${meta.id}-binDepths.tsv"

I agree that using nf-core modules as much as possible and I also agree that splitting up the metabat binning into several processes is a good plan. Currently, nf-core/mag is using almost exclusively local modules because we did DSL2 conversion before modules were set up in a stable way.

Also, please try to realize also for Maxbin2 the function of --binning_map_mode

@jfy133
Copy link
Member Author

jfy133 commented Nov 17, 2021

Thanks for the feedback @d4straub , I take everything into consideration 👍

A bit crappy about missing the 4 Mbp contig indeed, so I'll definitely look into how to deal with that.

@jfy133
Copy link
Member Author

jfy133 commented Nov 25, 2021

Step one will be in #263

@maxibor
Copy link
Member

maxibor commented Dec 7, 2021

One bin refinement tool to add to the list would be metawrap (the bin_refinement module mostly).
I opened an issue: nf-core/modules#1123

@jfy133
Copy link
Member Author

jfy133 commented Mar 1, 2022

Update, we discovered checkM doesn't (and most likely won't for a long time) work nicely with containers becuase it requires you to modify a system-level file to point the tool to the location of a file [which I only just saw the whole checkM -> BUSCO conversation back in th eDSL1 days 🤦‍♀]. Given the sceond half of the refinement working described in the OP relies heavily on this, as well as metawrap, we've decided it's not worth implementing it here - as it doesn't make sense to run polyMut/gunc on un-refined low-quality bins.

I'll look back at adding DAS_Tool anyway as an option, but I will stop there

@d4straub d4straub added this to the 2.2.0 milestone Mar 22, 2022
@jfy133
Copy link
Member Author

jfy133 commented Jun 14, 2022

Completed in #291

@jfy133 jfy133 closed this as completed Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants