Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sniff_format is inconsistant. #1539

Closed
edmundmiller opened this issue May 3, 2022 · 3 comments
Closed

sniff_format is inconsistant. #1539

edmundmiller opened this issue May 3, 2022 · 3 comments
Labels
bug Something isn't working template nf-core pipeline/component template

Comments

@edmundmiller
Copy link
Contributor

edmundmiller commented May 3, 2022

Description of the bug

I'm getting a "no header" error on this csv? It's the default check samplesheet.

CI: https://github.com/nf-osi/viralintegration/runs/6267396130?check_suite_focus=true
Samplesheet: https://github.com/nf-core/test-datasets/blob/viralintegration/samplesheet/samplesheet.csv

https://github.com/nf-osi/viralintegration/blob/dev/bin/check_samplesheet.py

It works with this samplesheet https://github.com/nf-core/test-datasets/blob/rnaseq/samplesheet/v3.4/samplesheet_test.csv
It also works if you remove either of the samples, but not if you have both.

Command used and terminal output

git clone https://github.com/nf-osi/viralintegration.git
cd viralintegration

wget https://github.com/nf-core/test-datasets/raw/viralintegration/samplesheet/samplesheet.csv
python3 bin/check_samplesheet.py samplesheet.csv valid.csv

wget https://github.com/nf-core/test-datasets/raw/rnaseq/samplesheet/v3.4/samplesheet_test.csv
python3 bin/check_samplesheet.py samplesheet_test.csv valid.csv

# OR in viralintegration
gh pr checkout 12
nextflow run . -profile test

System information

No response

@edmundmiller edmundmiller added the bug Something isn't working label May 3, 2022
@edmundmiller
Copy link
Contributor Author

I also can't find an example of a pipeline running the sniff_format function.

@ewels ewels added the template nf-core pipeline/component template label May 5, 2022
@bunop
Copy link
Contributor

bunop commented Oct 20, 2022

Dear all,

I've found a similar problem in a pipeline I wrote starting from nf-core template. For what I understand this problem is raised by this call in check_samplesheet.py:

if not sniffer.has_header(peek):

which accordingly with csv.Sniffer.has_header documentation: This method is a rough heuristic and may produce both false positives and negatives. This is the simplest example I can produce:

import io
import csv

test = """sample,fastq_1,fastq_2
200-1-5,1_ID2101_200-1-5-H9H05KWZ-H1_S1_L001_R1_001.fastq.gz,1_ID2101_200-1-5-H9H05KWZ-H1_S1_L001_R2_001.fastq.gz
201-1-9,1_ID2101_201-1-9-H9H05KWZ-A2_S1_L001_R1_001.fastq.gz,1_ID2101_201-1-9-H9H05KWZ-A2_S1_L001_R2_001.fastq.gz
202-1-10,1_ID2101_202-1-10-H9H05KWZ-B2_S1_L001_R1_001.fastq.gz,1_ID2101_202-1-10-H9H05KWZ-B2_S1_L001_R2_001.fastq.gz\n203-1-12,1_ID2101_203-1-12-H9H05KWZ-C2_S1_L001_R1_001.fastq.gz,1_ID2101_203-1-12-H9H05KWZ-C2_S1_L001_R2_001.fastq.gz"""

# read data into array to test with different line combinations
handle = io.StringIO(test)
lines = handle.readlines()

sniffer= csv.Sniffer()

# this will return False
sniffer.has_header("".join(lines))  # False

# however, I can have true with three rows
sniffer.has_header("".join(lines[:3]))  # True

# adding a row break the test
sniffer.has_header("".join(lines[:4]))  # False

# however is not a problem of 4th row
sniffer.has_header("".join(lines[:1]+lines[3:]))  #True

The python documentation describe the heuristic behind this function. For what I understand, renaming sample names with numbers solves this problem.

I understand that this issue is unpredictable and occurs in very few cases, so I don't like to propose adopting a standard for header format, however could be possible to add a parameter to check_samplesheet.py in order to skip this check when I'm sure that this file is correct? Or also providing the header using parameters? This means also modifing samplesheet_check.nf to accept additional parameters, for example using modules.config

@mirpedrol
Copy link
Member

solved in #2194

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working template nf-core pipeline/component template
Projects
None yet
Development

No branches or pull requests

4 participants