Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/!\ DO NOT MERGE: This is the FIRST PR for community review #57

Closed
wants to merge 81 commits into from

Conversation

maxulysse
Copy link
Member

Couldn't use the first commit an origin to create a new branch or as a base for the PR, so I used TEMPLATE, but results should be similar.
So please, DO NOT MERGE, if suggestions are made, please DO NOT ACCEPT either, but makes PR towards dev.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/pixelator branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link
Contributor

@FriederikeHanssen FriederikeHanssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the custom scripts need to have licenses and the authors
Harshil alignment for code readability (i.e. on ? and : in the modules.config)

README.md Outdated Show resolved Hide resolved
// TODO nf-core: Specify the paths to your test data on nf-core/test-datasets
// TODO nf-core: Give any required params for the test so that command line flags are not needed
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv'
input = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/nf-core-pixelator/testdata/micro/test_samplesheet.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will these move to nf-core/test-data once you are read to go public?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's small you could just put it within the repo (tests/data/test_samplesheet.csv or something).


- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
- [cutadapt](http://dx.doi.org/10.14806/ej.17.1.200)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used internally? I don't see the modules, same for fastp

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we use cutadapt and fastp as subprocesses in the pixelator tool.
Should we remove them here to reduce confusion?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, does that hook into the demux process? I'm guessing it's to speed up by piping it all straight through?

Just wondering if that could be flipped off in the future to let Nextflow handle the job submission in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, cutadapt is used in both the demux and adapterqc processes.

Just wondering if that could be flipped off in the future to let Nextflow handle the job submission in this case.

This is something @adamrtalbot was also wondering about.

The main reason to wrap cutadapt is because we want to make sure running a pixelator analysis on a single sample remains straightforward without using the pipeline.

I do not think that unwrapping it from the pixelator single-cell demux command is something we would want to do in the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the same thing is done in bases2fastq with Element, for example. But there is a flag to flip it off as well.

ch_multiqc_logo.toList()
)
multiqc_report = MULTIQC.out.report.toList()
// TODO: Add MultiQC after plugins are implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it be added for v1.0?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not make it before 1.0 (but we definitely plan to work on it soon after)

Copy link

@nvnieuwk nvnieuwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment from me :)

lib/WorkflowPixelator.groovy Outdated Show resolved Hide resolved
Copy link

@edmundmiller edmundmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few style things for me. I don't see anything major!

A thought, could the pixelator report be added to the tower.yml?

.gitignore Outdated Show resolved Hide resolved

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
- [cutadapt](http://dx.doi.org/10.14806/ej.17.1.200)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, does that hook into the demux process? I'm guessing it's to speed up by piping it all straight through?

Just wondering if that could be flipped off in the future to let Nextflow handle the job submission in this case.


withName: 'SAMPLESHEET_CHECK' {
if (params.pixelator_container) {
container = params.pixelator_container

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this container shouldn't need pixelator, just python by itself? Or is there something we're missing here?


withName: 'SAMPLESHEET_CHECK' {
if (params.pixelator_container) {
container = params.pixelator_container

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would just using process.container = '/path/to/container' in their local config for the run work if you just want every container to be pixelator?

ext.args = {
[
"--design ${meta.design}",
(params.trim_front != null)? "--trim-front ${params.trim_front}": '',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think the user input validation would get caught in the schema check at the beginning of the workflow.

What's the ideal arg to be passed? --trim-front 50 for example? And the problem is, you can't get --trim-front 0?

modules/local/samplesheet_check.nf Outdated Show resolved Hide resolved
nextflow.config Outdated Show resolved Hide resolved
nextflow.config Outdated Show resolved Hide resolved
Comment on lines 85 to 93
ch_panel_files_grouped = ch_report_data.map { id, data -> [ data[0], data[1] ] }
ch_amplicon_grouped = ch_report_data.map { id, data -> data[2] ? data[2].flatten() : [] }
ch_preqc_grouped = ch_report_data.map { id, data -> data[3] ? data[3].flatten() : [] }
ch_adapterqc_grouped = ch_report_data.map { id, data -> data[4] ? data[4].flatten() : [] }
ch_demux_grouped = ch_report_data.map { id, data -> data[5] ? data[5].flatten() : [] }
ch_collapse_grouped = ch_report_data.map { id, data -> data[6] ? data[6].flatten() : [] }
ch_graph_grouped = ch_report_data.map { id, data -> data[7] ? data[7].flatten() : [] }
ch_annotate_grouped = ch_report_data.map { id, data -> data[8] ? data[8].flatten() : [] }
ch_analysis_grouped = ch_report_data.map { id, data -> data[9] ? data[9].flatten() : [] }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there's a "fancier" way to write this, but I also just appreciate the verbosity and that it's easy to understand what's going on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is quite verbose. Suggestions are welcome 🙂

WorkflowPixelator.initialise(params, log)
// Inject the samplesheet SHA-1 into the params object
ch_input = file(params.input)
params.samplesheet_sha = ch_input.bytes.digest('sha-1')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this, that's interesting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this value with our AWS config to set the output path.
This makes sure we do not override result files when doing reruns of previous experiments with modified parameters or samplesheet.

.github/workflows/ci.yml Outdated Show resolved Hide resolved
.pre-commit-config.yaml Outdated Show resolved Hide resolved
Comment on lines 22 to 54
def make_absolute_path(path: str, base: PathLike = None) -> str:
"""If `path` is a relative path without a scheme, resolve it as relative to `base`

Take into account that paths can be references to remote resources eg. [s3://, az://, gs://, file:///, https://, ... ]
"""
url = urllib.parse.urlparse(path)
if url.scheme:
return path

if base is None:
url_props = list(url)
url_props[0] = "file"
path_component = PurePath(url_props[2])
url_props[2] = str(path_component) if path_component.is_absolute() else str(PurePath("/", str(path_component)))
urllib.parse.urlunparse(url_props)
return urllib.parse.urlunparse(url_props)

# If the base url has a scheme we need to keep that
# purepath will remove double slashes and this invalidates the base scheme
base_url = urllib.parse.urlparse(str(base))
scheme = base_url[0] or "file"
resolved_path = PurePath(base_url.path) / path
url = list(base_url)
url[0] = base_url[0] or "file"
url[2] = str(resolved_path)

# Make sure there are three /// (hidden netloc) in urls with file scheme.
# other schemes [s3, gs, az] only use two //
if scheme == "file":
resolved_path = str(resolved_path) if resolved_path.is_absolute() else str(PurePath("/", str(resolved_path)))
return f"{scheme}://{resolved_path}"

return urllib.parse.urlunparse(url)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me nervous. Why is it necessary? You should never point to an absolute path in Nextflow really...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that we found very useful at pixelgen.

We are rewriting relative paths in the samplesheet relative to the parent directory of the samplesheet.
This works with a remote samplesheet as well (http, s3, az, gs, ...).

This allows us to create a directory with the samplesheet and input data files and move that directory anywhere (local or remote) without having to edit paths in the samplesheet. (As long as inputs and samplesheet have a common root dir of course)

@FriederikeHanssen This is also the reason why we are not yet using nf-validation to load the samplesheet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it's a bit risky messing around with strings like they're paths on the process.

You could do this in Nextflow after the samplesheet has been parsed, then go through a little logic problem for if file doesn't exist add "full" path then check if the file exists again then if fail if it's not there? It might be a little bit cleaner as well.

If it works for you so far no problem, but something to consider replacing in the near future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the relative path logic to be done in nextflow after the normal validation with nf-validation as you suggested in PixelgenTechnologies@1cc3116.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that solution is way nicer. It pulls on all the native Nextflow-y goodness. The main downside is it's not as obvious for a Python person but I think that's worth the cost.


withName: 'SAMPLESHEET_CHECK' {
if (params.pixelator_container) {
container = params.pixelator_container

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easier to separate the background data sources from the software and use the same container, but configure alternative inputs.

Does the SAMPLESHEET_CHECK need to use alternative containers? Seems like a Python process and probably doesn't need the extra code?

Comment on lines 35 to 37
if (params.pixelator_container) {
container = params.pixelator_container
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, feels weird on every process including the ones that don't use the Pixelator software. But if you are going to add this to every process you can just do:

process {
    if (params.pixelator_container) {
        container = params.pixelator_container
    }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels weird on every process including the ones that don't use the Pixelator software.

As well as collecting info about the nextflow environment, COLLECT_METADATA is also reporting on the exact execution environment of the pixelator tool. So this step must be run with the pixelator container / conda env / ... .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah dang. This is why nf-core pipelines create a versions.yml in every process then combine them at the end.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The versions.yml go a long way, but for pixelator we also wanted to have a list of all python packages in the env to debug compatibility issues. I have renamed the process to PIXELATOR_COLLECT_METADATA to make it more clear that this is pixelator specific.

// TODO nf-core: Specify the paths to your test data on nf-core/test-datasets
// TODO nf-core: Give any required params for the test so that command line flags are not needed
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv'
input = "https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/nf-core-pixelator/testdata/micro/test_samplesheet.csv"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's small you could just put it within the repo (tests/data/test_samplesheet.csv or something).

docs/output.md Show resolved Hide resolved
def nextflowJson = builder.toPrettyString()

"""
echo '${nextflowJson}' > nextflow-metadata.json

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use jq for JSON validation here.

Copy link
Collaborator

@fbdtemme fbdtemme Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what jq would add here? nextflowJson should already contains valid json since we generate it with the groovy json builder.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I was just wanting to be cautious in case it gets messed up somehow.

tag "$meta.id"
label 'process_medium'

conda "local::pixelator=0.12.0"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work for anyone else? How about on a clean cloud computer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a placeholder to test with a local conda recipe. After the main tool is public in bioconda we will update this. I will add some todo's as reminders.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No release until it's on bioconda then 😉


prefix = task.ext.prefix ?: "${meta.id}"
def args = task.ext.args ?: ''
def panelOpt = panel ?: panel_file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def panelOpt = panel ?: panel_file
def panelOpt = panel ? "--panel $panel_file" : ""

I might not understand what this is doing, but don't you want to add a panel if panel is true?

Also, you could simplify to this and drop the val(panel), where panel_file is an empty list for an optional input:

Suggested change
def panelOpt = panel ?: panel_file
def panelOpt = panel_file ? "--panel $panel_file" : ""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a bit unclear.

The underlying --panel option in pixelator can take two types of inputs:

  • A string key to a built-in panel (most likely use case).
  • A path to a custom panel file.

In the pipeline we need to split this up in the samplesheet as panel for a known string key and panel_file for a a path. One or the other must be given. So if panel is set we pass that on to --panel if unset we will pass the file from panel_file to --panel. The samplesheet validation will make sure that not both are set (or empty).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Perhaps this so it's automatic based on whether panel or panel_file is populated?

input:
tuple val(meta), path(reads), val(panel), path(panel_file)
def panelOpt = panel      ? "--panel $panel"      : 
               panel_file ? "--panel $panel_file" : 
               ""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Perhaps this so it's automatic based on whether panel or panel_file is populated?

input:
tuple val(meta), path(reads), val(panel), path(panel_file)
def panelOpt = panel      ? "--panel $panel"      : 
               panel_file ? "--panel $panel_file" : 
               ""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this in a043d2b.

Samplesheet validation and channel construction is now done using nf-validation. Relative paths in samplesheets are now resolved in nextflow itself instead of the old check_samplesheet.py. panel keys are now validated as well.
@@ -57,7 +101,6 @@ params {
validationSchemaIgnoreParams = 'genomes'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need to add genomes to validationSchemaIgnoreParams since you're not using igenomes or any params.genomes

Copy link

@adamrtalbot adamrtalbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with all the changes, all lookin' good now!

Copy link
Member

@ewels ewels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ambarrio ambarrio closed this Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants