Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

265 metadata validation required fields #271

Open
wants to merge 15 commits into
base: dev
Choose a base branch
from

Conversation

jessicarowell
Copy link
Collaborator

Description

  • Added a yaml file that's a dictionary of required BioSample packages.
  • Integrated this yaml dict into metadata validation, checking for required and "at least one required" fields of the user's selected BioSample package
  • Various minor improvements to metadata validation script clarity

Checklist

Go Through Checklist Below and Place A ✔️ (X Inside the Box) if Completed

General Checks

  • Have you run appropriate tests (unit/integration/end-to-end) to check logic across run environments (Conda/Docker/Singularity on Scicomp/AWS/NF Tower/Local)?
    singularity on scicomp

    For each relevant configuration:

    • Can the program run completely through without erroring out?
    • Does it produce the expected outputs, given the inputs provided?
  • Have you conducted proper linting procedures?

    • Numpy formatted docstrings for functions
    • Comments explaining lines of code
    • Consistent and intuitive naming conventions for variables, functions, classes, methods, attributes, and scripts
    • Single empty line between class functions, two lines between non-class functions, and two lines between imports and code body
    • Camel case formatting for class names
  • [] Have you updated existing documentation (README.md, etc.) or created new ones within docs?
    None needed

CDC Checks

  • [N/A] Did you check for sensitive data, and remove any?
  • [N/A] If you added or modified HTML, did you check that it was 508 compliant?

Are additional approvals needed for this change? If so, please mention them below:

Are there potential vulnerabilities or licensing issues with any new dependencies introduced? If so, please mention them below:

@RamiyapriyaS
Copy link
Collaborator

RamiyapriyaS commented Feb 21, 2025

Ran the following test:

  • Removed strain from metadata file
  • updated submission.config file to use the following BioSample package BioSample_package: "OneHealthEnteric.1.0"
  • Ran pipeline with updated file

Created the following error message:

WARN: Access to undefined parameter `enable_conda` -- Initialise it to a default value eg. `params.enable_conda = some_value`
ERROR ~ Error executing process > 'TOSTADAS_WORKFLOW:TOSTADAS:METADATA_VALIDATION'

Caused by:
  Process `TOSTADAS_WORKFLOW:TOSTADAS:METADATA_VALIDATION` terminated with an error exit status (1)


Command executed:

  validate_metadata.py             --meta_path onehealth_biosample_package_template.xlsx             --output_dir .             --custom_fields_file /scicomp/home-pure/rjd0/tostadas/assets/custom_meta_fields/example_custom_fields.json             --validate_custom_fields false             --date_format_flag s                                                     --config_file /scicomp/home-pure/rjd0/tostadas/conf/submission_config.yaml             --biosample_fields_key /scicomp/home-pure/rjd0/tostadas/assets/biosample_fields_key.yaml

Command exit status:
  1

Command output:
  flag: s
  6: False

Command error:
  Traceback (most recent call last):
    File "/scicomp/home-pure/rjd0/tostadas/bin/validate_metadata.py", line 1061, in <module>
  flag: s
  6: False
      metadata_validation_main()
    File "/scicomp/home-pure/rjd0/tostadas/bin/validate_metadata.py", line 67, in metadata_validation_main
      validate_checks.validate_main()
    File "/scicomp/home-pure/rjd0/tostadas/bin/validate_metadata.py", line 334, in validate_main
      self.check_date()
    File "/scicomp/home-pure/rjd0/tostadas/bin/validate_metadata.py", line 582, in check_date
      raise ValueError("Date validation failed. Check 'date_errors.txt' for details.")
  ValueError: Date validation failed. Check 'date_errors.txt' for details.

Work dir:
  /scicomp/scratch/rjd0/nextflow/work/a7/6de6eaee6110a08299f378229478e7

Container:
  /scicomp/home-pure/rjd0/.singularity/staphb-tostadas-latest.img

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

@RamiyapriyaS
Copy link
Collaborator

RamiyapriyaS commented Feb 21, 2025

Ran the following test:

  • Removed organism from metadata file
  • updated submission.config file to use the following BioSample package: "SARS-CoV-2.cl.1.0"
  • Ran pipeline with updated file

Meta data validation ran without error. It should have failed because organism is a required field.

Additional information:

contents of ~/tostadas/test_meta_one_health/validation_outputs/meta_wO_organism/errors/full_error.txt

General Errors:

	Passed all global checks!

Sample Errors:

	Number of Valid Samples: 0/1

	DRR152972:
		SRA Submission Detected:  Illumina Found	 Nanopore Not found
		Errors:
		assets/sample_fastqs/rsv/DRR152972.R1.fastq.gz does not exist or there are permission problems
		assets/sample_fastqs/rsv/DRR152972.R2.fastq.gz does not exist or there are permission problems

@jessicarowell
Copy link
Collaborator Author

It's failing when I test it - but missing geo_loc_name instead of the expected "missing organism".

Geo_loc_name is created in the script, after the main validation. So possibly it's "failing to appropriately fail" and then failing later because it doesn't reach the geo_loc_name creation step? I'm working on it.

Error:
Command executed:

validate_metadata.py --meta_path mpxv_test_metadata_update.xlsx --output_dir . --custom_fields_file /scicomp/home-pure/ick4/01.scripts/tostadas/assets/custom_meta_fields/example_custom_fields.json --validate_custom_fields false --date_format_flag s --config_file /scicomp/home-pure/ick4/02.scratch/submission_config.yaml --biosample_fields_key /scicomp/home-pure/ick4/01.scripts/tostadas/assets/biosample_fields_key.yaml

Command exit status:
1

Command output:
flag: s
Custom file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/custom_meta_fields/example_custom_fields.json
The following fields in the metadata dataframe are not in the JSON custom fields: {'host_disease'}
Custom file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/custom_meta_fields/example_custom_fields.json
WARNING: {'host_disease'} were not found in the custom fields JSON and were not validated. They are included as-is.

Metadata Validation Failed Please Consult : ./mpxv_test_metadata_update/errors/full_error.txt for a Detailed List

Command error:
flag: s
Custom file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/custom_meta_fields/example_custom_fields.json
The following fields in the metadata dataframe are not in the JSON custom fields: {'host_disease'}
Custom file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/custom_meta_fields/example_custom_fields.json
WARNING: {'host_disease'} were not found in the custom fields JSON and were not validated. They are included as-is.

Metadata Validation Failed Please Consult : ./mpxv_test_metadata_update/errors/full_error.txt for a Detailed List

Work dir:
/scicomp/scratch/ick4/4a/f6206febc83fa3c30989b0917e9f04

Command:
nextflow run main.nf -profile conda,test --species virus --output_dir test --submission_config /scicomp/home-pure/ick4/02.scratch/submission_config.yaml --annotation false --submission false --submission_wait_time 1 --meta_path /scicomp/home-pure/ick4/02.scratch/mpxv_test_metadata_update.xlsx

mpxv_test_metadata_update.xlsx

@jessicarowell
Copy link
Collaborator Author

Ok I fixed the geo_loc_name problem, and now I'm trying to figure out why it returns "files don't exist or have insufficient permissions" errors on the test fastqs under assets. os.path.exists() finds them when I run python interactively so I'm not sure what's up. Bottom line: metadata validation is still failing, and that's the only error listed in the errors txt file.

@jessicarowell
Copy link
Collaborator Author

This error is almost resolved. The first issue resulted from a change I had previously made - to make metadata validation fail if any one sample fails (as it should). Then I turned to addressing the problem of "file not found" for SRA file paths. This is happening because relative paths are given in the metadata file, but they are relative to projectDir not workDir. The metadata validation is quite convoluted unfortunately, so making the change in location didn't fix the problem. The current error is that the filepaths aren't being resolved everywhere in the script. See DEBUG statements below:

Command error:
flag: s
Custom file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/custom_meta_fields/example_custom_fields.json
Custom file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/custom_meta_fields/example_custom_fields.json
[DEBUG] Newly resolved illumina path 1: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/LIY15306A2_2022_054_3005007722.R_1.mpx.fastq.gz
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/LIY15306A2_2022_054_3005007722.R_1.mpx.fastq.gz, Exists: True
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/LIY15306A2_2022_054_3005007722.R_2.mpx.fastq.gz, Exists: True
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/, Exists: False
[DEBUG] Checking file: assets/sample_fastqs/mpox/LIY15306A2_2022_054_3005007722.R_1.mpx.fastq.gz, Exists: False
[DEBUG] Checking file: assets/sample_fastqs/mpox/LIY15306A2_2022_054_3005007722.R_2.mpx.fastq.gz, Exists: False
[DEBUG] Newly resolved illumina path 1: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-028-7666_S3_L001_R1_001.fastq.gz
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-028-7666_S3_L001_R1_001.fastq.gz, Exists: True
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-028-7666_S3_L001_R2_001.fastq.gz, Exists: True
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/, Exists: False
[DEBUG] Checking file: assets/sample_fastqs/mpox/2022-028-7666_S3_L001_R1_001.fastq.gz, Exists: False
[DEBUG] Checking file: assets/sample_fastqs/mpox/2022-028-7666_S3_L001_R2_001.fastq.gz, Exists: False
[DEBUG] Newly resolved illumina path 1: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-029-7670_S4_L001_R1_001.fastq.gz
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-029-7670_S4_L001_R1_001.fastq.gz, Exists: True
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-029-7670_S4_L001_R2_001.fastq.gz, Exists: True
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/, Exists: False
[DEBUG] Checking file: assets/sample_fastqs/mpox/2022-029-7670_S4_L001_R1_001.fastq.gz, Exists: False
[DEBUG] Checking file: assets/sample_fastqs/mpox/2022-029-7670_S4_L001_R2_001.fastq.gz, Exists: False
[DEBUG] Newly resolved illumina path 1: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-034-7690_S9_L001_R1_001.fastq.gz
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-034-7690_S9_L001_R1_001.fastq.gz, Exists: True
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-034-7690_S9_L001_R2_001.fastq.gz, Exists: True
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/, Exists: False
[DEBUG] Checking file: assets/sample_fastqs/mpox/2022-034-7690_S9_L001_R1_001.fastq.gz, Exists: False
[DEBUG] Checking file: assets/sample_fastqs/mpox/2022-034-7690_S9_L001_R2_001.fastq.gz, Exists: False
[DEBUG] Newly resolved illumina path 1: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-053-7721_S6_L001_R1_001.fastq.gz
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-053-7721_S6_L001_R1_001.fastq.gz, Exists: True
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/assets/sample_fastqs/mpox/2022-053-7721_S6_L001_R2_001.fastq.gz, Exists: True
[DEBUG] Checking file: /scicomp/home-pure/ick4/01.scripts/tostadas/, Exists: False
[DEBUG] Checking file: assets/sample_fastqs/mpox/2022-053-7721_S6_L001_R1_001.fastq.gz, Exists: False
[DEBUG] Checking file: assets/sample_fastqs/mpox/2022-053-7721_S6_L001_R2_001.fastq.gz, Exists: False

Metadata Validation Failed Please Consult : ./mpxv_test_metadata_update/errors/full_error.txt for a Detailed List

Work dir:
/scicomp/scratch/ick4/a1/e078b8fa9ba0ee80f0351832ce7d2f

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run

-- Check '.nextflow.log' file for details

@jessicarowell
Copy link
Collaborator Author

Ready for testing! Everything has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants