Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samplesheet check too stringent for header check #152

Open
askol-lurie opened this issue Mar 7, 2023 · 8 comments
Open

samplesheet check too stringent for header check #152

askol-lurie opened this issue Mar 7, 2023 · 8 comments
Labels
bug Something isn't working
Milestone

Comments

@askol-lurie
Copy link

askol-lurie commented Mar 7, 2023

Description of the bug

I'm starting to use v2.0.0 of the nf-core HiC. I used the previous version but always submitted one sample at a time. This time, I created a samplesheet and am running into an issue where hic doesn't think the file has a header. It does. The has_header() function of the cvs module used in check_samplesheet.py is overly stringent in how it defines headers and seems like it would fail for must samplesheets, as it does for mine.

The following sample sheets will fail and succeed, respectively:

sample,fastq_1,fastq_2
RH41_B6,1,2
SMS_A3,1,2
sample,fastq_1,fastq_2
RH41_B6,1,2
SMS_A3,p1,q2

Command used and terminal output

nextflow run /home/ass6094/bin/nextflow_modules/hic_v2.0.0/main.nf \
--digestion 'qiagen' \
--input /projects/b1103/HIC_Macquarrie/hic_round2/samplesheet.csv \
  --outdir $outdir \
  --fasta /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa  \
 --bwt2_index /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/   \
--split_fastq --fastq_chunks_size 10000000   --max_memory 64.GB   --bin_size \ 20000,40000,150000,500000,1000000  \ --bwt2_opts_end2end \
'--very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14'   --bwt2_opts_trimmed ' \
--very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14' \
-profile singularity,slurmshort   -with-report hic_report.html -with-trace \
-with-timeline hic_timeline.html   -with-dag hic_dag.png -bg   -w $scratch

Output:

------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/hic v2.0.0
------------------------------------------------------
Core Nextflow options
  runName                      : magical_hamilton
  containerEngine              : singularity
  launchDir                    : /projects/b1103/HIC_Macquarrie/hic_round2
  workDir                      : /scratch/ass6094/hic/nextflow
  projectDir                   : /home/ass6094/bin/nextflow_modules/hic_v2.0.0
  userName                     : ass6094
  profile                      : singularity,slurmshort
  configFiles                  : /home/ass6094/bin/nextflow_modules/hic_v2.0.0/nextflow.config

Input/output options
  input                        : /projects/b1103/HIC_Macquarrie/hic_round2/samplesheet.csv
  outdir                       : /projects/b1103/HIC_Macquarrie/hic_round2/NextflowResults/

Reference genome options
  fasta                        : /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa
  bwt2_index                   : /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/

Digestion Hi-C
  digestion                    : qiagen

DNAse Hi-C
  min_cis_dist                 : 0

Alignments
  split_fastq                  : true
  fastq_chunks_size            : 10000000
  bwt2_opts_end2end            : --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14
  bwt2_opts_trimmed            : --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14

Valid Pairs Detection
  max_insert_size              : 0
  min_insert_size              : 0
  max_restriction_fragment_size: 0
  min_restriction_fragment_size: 0

Contact maps
  bin_size                     : 20000,40000,150000,500000,1000000
  ice_filter_high_count_perc   : 0
  res_zoomify                  : null

Downstream Analysis
  res_dist_decay               : 250000
  tads_caller                  : insulation
  res_tads                     : 40000

Max job request options
  max_cpus                     : 14
  max_memory                   : 64.GB
  max_time                     : 10d

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/hic for your analysis please cite:

* The pipeline
  https://doi.org/10.5281/zenodo.2669513

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/hic/blob/master/CITATIONS.md
------------------------------------------------------
WARN: A process with name 'BOWTIE2_ALIGN_TRIMMED' is defined more than once in module script: /home/ass6094/bin/nextflow_modules/hic_v2.0.0/./workflows/../subworkflows/local/./hicpro_mapping.nf -- Make sure to not define the same function as process
[65/1f640c] Submitted process > NFCORE_HIC:HIC:PREPARE_GENOME:GET_RESTRICTION_FRAGMENTS (^GATC)
[11/c26455] Submitted process > NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)
[10/5e7883] Submitted process > NFCORE_HIC:HIC:PREPARE_GENOME:CUSTOM_GETCHROMSIZES (genome.fa)
Error executing process > 'NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)'

Caused by:
  Process `NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)` terminated with an error exit status (1)

Command executed:

  check_samplesheet.py \
      samplesheet.csv \
      samplesheet.valid.csv
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  

Command error:
  WARNING: While bind mounting '/projects/b1103/HIC_Macquarrie/hic_round2:/projects/b1103/HIC_Macquarrie/hic_round2': destination is already in the mount point list
  WARNING: While bind mounting '/home/ass6094/bin/nextflow_modules/hic_v2.0.0/bin:/home/ass6094/bin/nextflow_modules/hic_v2.0.0/bin': destination is already in the mount point list
  WARNING: While bind mounting '/scratch/ass6094/hic/nextflow/11/c26455104fe4b102d7953120fa3a65:/scratch/ass6094/hic/nextflow/11/c26455104fe4b102d7953120fa3a65': destination is already in the mount point list
  WARNING: Skipping mount /hpc/software/singularity/3.8.1/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  [CRITICAL] The given sample sheet does not appear to contain a header.

Relevant files

No response

System information

nextflow version 22.10.5.5840
Hardware: Slurm HPC
Executor: slurm
Container engine:Singularity
OS: Redhat Linux 7.9
Version of nf-core/hic 2.0.0

@askol-lurie askol-lurie added the bug Something isn't working label Mar 7, 2023
@askol-lurie askol-lurie changed the title samplesheet check too too stringent for header check samplesheet check too stringent for header check Mar 9, 2023
@nservant
Copy link
Collaborator

So for sure, this is linked to the csv.Sniffer.has_header function, which return false. No idea why.

@nservant
Copy link
Collaborator

I checked whether in both cases, the csv package is able to detect the delimiter, and yes. Both files report ',' as the delimiter ...

Line 60

d = sniffer.sniff(peek)
print(repr(d.delimiter))

@nservant
Copy link
Collaborator

nservant commented Mar 23, 2023

So I think I have the solution for the provided exemple !
has_header return False because the two lines don't belong to the same type !

In

RH41_B6,1,2
SMS_A3,p1,q2

the 1 and 2 are seen as integer. While the p1 and p2 are seens as string.

python/cpython#87791

@nservant
Copy link
Collaborator

nservant commented Mar 23, 2023

To continue on that, and still based on the thread here python/cpython#87791
It's seems that the has_header function automatically detects the type of a column based on its content (numbers/letters ?)
When two rows have a different column typing pattern, the has_header return False

@nservant
Copy link
Collaborator

sample,fastq_1,fastq_2
101-male-brain,/data/file1_R1.fastq.gz,/data/file1_R2.fastq.gz
12-female-liver,/data/013649718184/file2_R1.fastq.gz,/data/013649718184/file2_R2.fastq.gz

is detected as having no header and crashed
whereas

sample,fastq_1,fastq_2
101-male-brain,/data/file1_R1.fastq.gz,/data/file1_R2.fastq.gz
120-male-liver,/data/013649718184/file2_R1.fastq.gz,/data/013649718184/file2_R2.fastq.gz

works ! that's crasy :)

@nservant
Copy link
Collaborator

will be fixed in the next version

nf-core/tools#2194

@maxulysse
Copy link
Member

Just because this 12-female-liver -> 120-male-liver in the sample column?

@nservant
Copy link
Collaborator

yes. But this will be fixed in the next nf-core template

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants