Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: NGS-bits SampleSimilarity #3500

Merged
merged 75 commits into from
Dec 2, 2024
Merged

feat: NGS-bits SampleSimilarity #3500

merged 75 commits into from
Dec 2, 2024

Conversation

tdayris
Copy link
Contributor

@tdayris tdayris commented Nov 25, 2024

Adds a wrapper for ngs-bits SampleSimilarity. There are a lot of subcommands for ngs-bits. Many of them requires a connection to a mysql database (NGSD), which I don't know how to deploy smoothly through a snakemake wrapper.

I'll wrap some other subcommands soon.

QC

While the contributions guidelines are more extensive, please particularly ensure that:

  • test.py was updated to call any added or updated example rules in a Snakefile
  • input: and output: file paths in the rules can be chosen arbitrarily
  • wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:)
  • temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to
  • the meta.yaml contains a link to the documentation of the respective tool or command under url:
  • conda environments use a minimal amount of channels and packages, in recommended ordering

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced environment.linux-64.pin.txt and environment.yaml files for managing package dependencies in a Conda environment.
    • Added a new metadata file meta.yaml for the NGS-bits SampleSimilarity tool, detailing input/output specifications and documentation links.
    • Added two new VCF files (a.vcf and b.vcf) containing genomic variant data for testing purposes.
    • Created a wrapper.py file to facilitate execution of the SampleSimilarity tool within a Snakemake workflow.
  • Tests

    • Implemented a new test function for the SampleSimilarity wrapper to ensure functionality and correctness.

tdayris and others added 30 commits September 21, 2020 09:16
* perf: update bio/bcftools/index/environment.yaml.

* perf: update bio/bcftools/index/environment.yaml.

* perf: update bio/bcftools/index/environment.yaml.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
* Add autobump action

* fix paths

* dbg

* dbg branch

* add checkout

* dbg

* trigger rerun

* entity regex and add label

* dbg

* Update autobump.yml

* Update autobump.yml
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (3)
bio/ngsbits/samplesimilarity/test/Snakefile (2)

1-6: Document input requirements and validate file formats

The rule could benefit from additional documentation and input validation:

  1. Add a rule docstring explaining the purpose and requirements
  2. Consider adding input validation for VCF files
  3. Clarify if the optional inputs (ref, roi) will be implemented
 rule test_ngsbits_samplesimilarity:
+    """
+    Test wrapper for ngs-bits SampleSimilarity tool.
+    
+    Input:
+        samples: List of VCF files to compare
+        ref: (optional) Path to reference genome index (fasta.fai)
+        roi: (optional) Regions of interest in BED format
+    """
     input:
-        samples=expand("{sample}.vcf", sample=("a", "b")),
+        samples=expand("{sample}.vcf", sample=("a", "b")),
+        # Validate VCF files
+        samples_validated=expand("{sample}.vcf", sample=("a", "b")),

7-10: Consider resource configuration best practices

While the current configuration is suitable for testing, consider:

  1. Adding a temp directory parameter for intermediate files
  2. Making threads configurable for production use
     output:
         "similarity.tsv",
-    threads: 1
+    threads: lambda wildcards, attempt: attempt * 2
+    resources:
+        mem_mb=lambda wildcards, attempt: attempt * 1024
     log:
         "samplesimilarity.log",
+    params:
+        # Add temp directory parameter
+        temp_dir=lambda wildcards: "temp/samplesimilarity"
test_wrappers.py (1)

199-212: Consider adding error case tests

Consider adding additional test cases to verify error handling, such as testing with invalid input files or missing parameters.

Example addition:

def test_ngsbits_samplesimilarity_errors(run):
    with pytest.raises(subprocess.CalledProcessError):
        run(
            "bio/ngsbits/samplesimilarity",
            ["snakemake", "--cores", "1", "--use-conda", "-F", "similarity.tsv"],
            check_log=lambda log: "error" in log.lower()
        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 50980f1 and e387867.

📒 Files selected for processing (8)
  • bio/ngsbits/samplesimilarity/environment.linux-64.pin.txt (1 hunks)
  • bio/ngsbits/samplesimilarity/environment.yaml (1 hunks)
  • bio/ngsbits/samplesimilarity/meta.yaml (1 hunks)
  • bio/ngsbits/samplesimilarity/test/Snakefile (1 hunks)
  • bio/ngsbits/samplesimilarity/test/a.vcf (1 hunks)
  • bio/ngsbits/samplesimilarity/test/b.vcf (1 hunks)
  • bio/ngsbits/samplesimilarity/wrapper.py (1 hunks)
  • test_wrappers.py (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • bio/ngsbits/samplesimilarity/environment.linux-64.pin.txt
  • bio/ngsbits/samplesimilarity/environment.yaml
  • bio/ngsbits/samplesimilarity/meta.yaml
  • bio/ngsbits/samplesimilarity/test/a.vcf
🧰 Additional context used
📓 Path-based instructions (2)
bio/ngsbits/samplesimilarity/wrapper.py (2)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.


Pattern **/wrapper.py: Do not complain about use of undefined variable called snakemake.

test_wrappers.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff (0.7.0)
bio/ngsbits/samplesimilarity/wrapper.py

11-11: tempfile.TemporaryDirectory imported but unused

Remove unused import: tempfile.TemporaryDirectory

(F401)


13-13: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)


26-26: Undefined name snakemake

(F821)

🔇 Additional comments (9)
bio/ngsbits/samplesimilarity/test/Snakefile (1)

1-14: Verify database connectivity requirements

The PR objectives mention MySQL database connectivity challenges with ngs-bits tools. Please clarify:

  1. Does SampleSimilarity require database access?
  2. If yes, how should database credentials be configured?
  3. Are there any fallback modes for running without a database?
bio/ngsbits/samplesimilarity/wrapper.py (2)

13-16: LGTM!

The logging setup follows Snakemake wrapper best practices by capturing both stdout and stderr.

🧰 Tools
🪛 Ruff (0.7.0)

13-13: Undefined name snakemake

(F821)


1-34: Verify database connection requirements

The PR objectives mention challenges with MySQL database connections for ngs-bits tools. Please verify if SampleSimilarity requires database access and if so, how it should be configured.

🧰 Tools
🪛 Ruff (0.7.0)

11-11: tempfile.TemporaryDirectory imported but unused

Remove unused import: tempfile.TemporaryDirectory

(F401)


13-13: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)


26-26: Undefined name snakemake

(F821)

bio/ngsbits/samplesimilarity/test/b.vcf (5)

5-11: LGTM! INFO fields are well-defined.

The INFO field definitions follow VCF specifications correctly and include all necessary fields for structural variants with proper type definitions and descriptions.


12-15: LGTM! FORMAT fields are properly defined.

The FORMAT fields include all essential genotype information with correct specifications for types and number of values.


16-16: Verify the presence of corresponding a.vcf file.

The sample column name 'a' suggests this file (b.vcf) is part of a test pair.

✅ Verification successful

Corresponding a.vcf file exists as expected

The file bio/ngsbits/samplesimilarity/test/a.vcf is present in the test directory, confirming that b.vcf is indeed part of a test pair.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for the presence of corresponding a.vcf file
fd "a.vcf" "bio/ngsbits/samplesimilarity/test"

Length of output: 84


17-18: Verify if missing QUAL scores and low coverage are acceptable for testing.

Observations about the variant records:

  1. Both variants have missing quality scores (QUAL=.)
  2. The insertion variant has very low alternate allele support (only 6 reads)
  3. The deletion variant has good coverage (40789,1734 reads)

While this might be acceptable for testing purposes, consider:

  • Adding quality scores for more realistic test data
  • Using variants with better coverage for the insertion
✅ Verification successful

Missing QUAL scores and low coverage are acceptable in test data

Looking at other VCF records in the test files across the repository, missing quality scores (.) appear to be a common pattern in test data. The low coverage for the insertion variant (6 reads) is acceptable in this context since:

  1. This is test data, not production data
  2. The variant is marked as PASS despite low coverage, which is suitable for testing edge cases
  3. Having variants with different coverage levels (low vs high) in test data is actually beneficial for testing robustness
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if other test VCF files in the repository typically include QUAL scores
rg -A 1 "^[^#].*\t\.\t" --type vcf

Length of output: 64


Script:

#!/bin/bash
# Search for VCF files and check their content for quality scores
fd -e vcf -x cat {} | grep -v "^#" | head -n 5

Length of output: 721


1-4: Consider updating VCF version and verifying data currency.

The file uses VCF v4.0 (current version is 4.3) and contains data from 2017. While this might be acceptable for testing purposes, consider:

  • Updating to a more recent VCF version for better compatibility
  • Verifying if more current test data would be beneficial
test_wrappers.py (1)

199-212: LGTM! Test function follows established patterns

The test function follows the consistent pattern used throughout the file, with appropriate use of the run fixture and standard snakemake parameters.

bio/ngsbits/samplesimilarity/test/Snakefile Show resolved Hide resolved
bio/ngsbits/samplesimilarity/wrapper.py Outdated Show resolved Hide resolved
bio/ngsbits/samplesimilarity/wrapper.py Outdated Show resolved Hide resolved
bio/ngsbits/samplesimilarity/wrapper.py Outdated Show resolved Hide resolved
bio/ngsbits/samplesimilarity/wrapper.py Outdated Show resolved Hide resolved
bio/ngsbits/samplesimilarity/wrapper.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
bio/ngsbits/samplesimilarity/wrapper.py (2)

16-24: Consider adding input validation.

While the parameter handling is safe, consider adding validation to ensure the input files exist and are accessible before proceeding with the execution.

 ref = snakemake.input.get("ref")
 if ref:
+    if not os.path.exists(ref):
+        raise ValueError(f"Reference file not found: {ref}")
     extra += f" -ref '{ref}' "

 roi = snakemake.input.get("roi")
 if roi:
+    if not os.path.exists(roi):
+        raise ValueError(f"ROI file not found: {roi}")
     extra += f" -roi '{roi}' "
🧰 Tools
🪛 Ruff (0.8.0)

16-16: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


21-21: Undefined name snakemake

(F821)


25-31: Consider validating input samples.

While the mode detection logic is correct, consider adding validation to ensure all input samples exist and are accessible.

 input_files = snakemake.input.get("samples")
+if not input_files:
+    raise ValueError("No input samples provided")
+for input_file in input_files:
+    if not os.path.exists(input_file):
+        raise ValueError(f"Input file not found: {input_file}")
🧰 Tools
🪛 Ruff (0.8.0)

25-25: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b150b26 and 0012ba0.

📒 Files selected for processing (2)
  • bio/ngsbits/samplesimilarity/test/Snakefile (1 hunks)
  • bio/ngsbits/samplesimilarity/wrapper.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • bio/ngsbits/samplesimilarity/test/Snakefile
🧰 Additional context used
📓 Path-based instructions (1)
bio/ngsbits/samplesimilarity/wrapper.py (2)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.


Pattern **/wrapper.py: Do not complain about use of undefined variable called snakemake.

📓 Learnings (1)
bio/ngsbits/samplesimilarity/wrapper.py (1)
Learnt from: tdayris
PR: snakemake/snakemake-wrappers#3500
File: bio/ngsbits/samplesimilarity/wrapper.py:0-0
Timestamp: 2024-11-26T09:23:03.628Z
Learning: In the `SampleSimilarity` wrapper (`wrapper.py`), the `input_files` parameter can contain multiple files. Adding quotes around `{input_files}` in the shell command should be avoided, as it may interfere with multiple inputs.
🪛 Ruff (0.8.0)
bio/ngsbits/samplesimilarity/wrapper.py

12-12: Undefined name snakemake

(F821)


16-16: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


21-21: Undefined name snakemake

(F821)


25-25: Undefined name snakemake

(F821)

🔇 Additional comments (3)
bio/ngsbits/samplesimilarity/wrapper.py (3)

1-11: LGTM! Clean and well-documented header section.

The file header includes all necessary metadata and proper documentation.


12-15: LGTM! Proper logging configuration.

The logging setup captures both stdout and stderr, which is essential for debugging and monitoring tool execution.

🧰 Tools
🪛 Ruff (0.8.0)

12-12: Undefined name snakemake

(F821)


33-38: LGTM! Shell command is properly constructed.

The shell command follows the correct format and properly handles spaces in paths where needed. Notably, the input_files parameter is intentionally left unquoted to support multiple inputs, as per the provided learning note.

@tdayris tdayris changed the title feat: Ngsbits sample similarity feat: NGS-bits SampleSimilarity Nov 26, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
bio/ngsbits/samplesimilarity/meta.yaml (3)

1-4: Consider using a versioned documentation URL

The current documentation URL points to the master branch, which may change over time. Consider using a versioned/tagged URL for better stability and reproducibility.


6-6: Improve clarity of samples input specification

The current description combines multiple file formats. Consider splitting this for better clarity:

-  - samples: list of paths to vcf/vcf.gz files, or list of paths to bam/sam/cram files
+  - samples: |
+      list of paths to input files. Supported formats:
+      - VCF/VCF.GZ files
+      - BAM/SAM/CRAM files

9-10: Enhance output specification details

The current output description lacks information about the TSV format and its contents. Consider adding details about:

  • Expected columns in the TSV
  • Format of the similarity metrics
  • Any header information
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8756e2c and 58768d1.

📒 Files selected for processing (5)
  • bio/ngsbits/samplesimilarity/environment.linux-64.pin.txt (1 hunks)
  • bio/ngsbits/samplesimilarity/environment.yaml (1 hunks)
  • bio/ngsbits/samplesimilarity/meta.yaml (1 hunks)
  • bio/ngsbits/samplesimilarity/wrapper.py (1 hunks)
  • test_wrappers.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • bio/ngsbits/samplesimilarity/environment.linux-64.pin.txt
  • bio/ngsbits/samplesimilarity/environment.yaml
  • test_wrappers.py
🧰 Additional context used
📓 Path-based instructions (1)
bio/ngsbits/samplesimilarity/wrapper.py (2)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.


Pattern **/wrapper.py: Do not complain about use of undefined variable called snakemake.

📓 Learnings (1)
bio/ngsbits/samplesimilarity/wrapper.py (1)
Learnt from: tdayris
PR: snakemake/snakemake-wrappers#3500
File: bio/ngsbits/samplesimilarity/wrapper.py:0-0
Timestamp: 2024-11-26T09:23:03.628Z
Learning: In the `SampleSimilarity` wrapper (`wrapper.py`), the `input_files` parameter can contain multiple files. Adding quotes around `{input_files}` in the shell command should be avoided, as it may interfere with multiple inputs.
🪛 Ruff (0.8.0)
bio/ngsbits/samplesimilarity/wrapper.py

12-12: Undefined name snakemake

(F821)


16-16: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


21-21: Undefined name snakemake

(F821)


25-25: Undefined name snakemake

(F821)

🔇 Additional comments (6)
bio/ngsbits/samplesimilarity/meta.yaml (1)

8-8: Rename 'regions' parameter for consistency

As suggested in the previous review, consider renaming 'regions' to maintain consistency with other wrappers.

bio/ngsbits/samplesimilarity/wrapper.py (5)

1-11: LGTM! Well-structured file header

The file header contains all necessary information including proper encoding, accurate docstring, complete author information, and appropriate imports.


12-15: LGTM! Proper logging configuration

The logging setup correctly captures both stdout and stderr using Snakemake's built-in formatter.

🧰 Tools
🪛 Ruff (0.8.0)

12-12: Undefined name snakemake

(F821)


16-24: LGTM! Robust parameter handling

The code properly handles optional parameters and uses the recommended :q modifier for shell-safe quoting of paths.

🧰 Tools
🪛 Ruff (0.8.0)

16-16: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


21-21: Undefined name snakemake

(F821)


33-38: LGTM! Well-structured command construction

The shell command is properly formatted with:

  • Clear separation of arguments
  • Correct handling of input files (no quotes as they can be multiple)
  • Proper quoting of the output path

25-31: Verify handling of mixed file types

While the file type detection logic is correct, we should verify that the tool properly handles cases where input files might have mixed extensions.

✅ Verification successful

Based on the search results, I can now generate the final response:

Mixed file types are handled correctly by design

The current implementation is correct and safe because:

  • The code uses all() to ensure consistent file types, which means it will only set a specific mode if all input files match that type's extension.
  • If files have mixed extensions, none of the conditions will be true, and it will default to "gsvar" mode.
  • The test configuration in bio/ngsbits/samplesimilarity/test/Snakefile shows the tool is tested with consistent file types (VCF files).
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if there are any existing test cases with mixed file types
# and verify tool's behavior documentation for such cases

# Search for test files and their usage
echo "Searching for test files and their configurations..."
rg -l "test_ngsbits_samplesimilarity" test/

# Search for documentation about mixed file types
echo "Searching for documentation about mixed file types..."
fd -g "*.md" -g "*.rst" | xargs rg -i "samplesimilarity.*mixed|mixed.*samplesimilarity"

Length of output: 504


Script:

#!/bin/bash
# Let's try a different approach to find test cases and documentation

# Search for test files related to samplesimilarity in the entire repository
echo "Searching for test files..."
fd -t f "test.*samplesimilarity"

# Look for any documentation files
echo "Looking for documentation files..."
fd -e md -e rst

# Search for any files containing samplesimilarity configurations or tests
echo "Searching for samplesimilarity usage patterns..."
rg -A 5 "samplesimilarity" --type python

# Search specifically for input file handling in the tool
echo "Searching for input file handling..."
rg -A 5 "input.*\.(vcf|bam|gsvar)" --type python

Length of output: 859


Script:

#!/bin/bash
# Let's try without file type filtering since it's not supported

# Search for samplesimilarity usage and tests
echo "Searching for samplesimilarity patterns..."
rg -A 5 "samplesimilarity"

# Search for input file handling in Python files
echo "Searching for input file handling..."
fd -e py | xargs rg -A 5 "input.*\.(vcf|bam|gsvar)"

# Look for test configurations or snakefiles
echo "Searching for test configurations..."
fd "Snakefile|config.yaml" | xargs rg "samplesimilarity"

# Search for error handling related to input files
echo "Searching for error handling..."
fd -e py | xargs rg -A 5 "raise.*Error.*input|error.*input"

Length of output: 45111

🧰 Tools
🪛 Ruff (0.8.0)

25-25: Undefined name snakemake

(F821)

bio/ngsbits/samplesimilarity/meta.yaml Show resolved Hide resolved
Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
bio/ngsbits/samplesimilarity/meta.yaml (3)

7-8: Clarify input format requirements

The input specification could be clearer about:

  1. VCF and BAM/SAM/CRAM being mutually exclusive input formats
  2. The relationship between input format and required parameters
-  - samples: list of paths to vcf/vcf.gz files, or list of paths to bam/sam/cram files
-  - ref: Optional path to reference genome index file (FAI). Required for CRAM input.
+  - samples: |
+      List of input files. Must be either:
+      - VCF/VCF.GZ files, or
+      - BAM/SAM/CRAM files (requires -mode bam)
+  - ref: |
+      Path to reference genome index file (FAI).
+      Required when using CRAM input files.

10-11: Document TSV output format

Consider adding details about the structure and content of the output TSV file to help users interpret the results.

 output:
-  - Path to output TSV results
+  - |
+    Path to output TSV results containing similarity metrics:
+    - Column 1: Sample1 name
+    - Column 2: Sample2 name
+    - Column 3+: Similarity metrics

1-13: Consider template for future NGS-bits wrappers

Since this is the first of potentially many NGS-bits wrappers, consider:

  1. Creating a common structure for database connection parameters
  2. Establishing consistent parameter naming across all NGS-bits wrappers
  3. Documenting shared requirements and dependencies

This will help maintain consistency as more NGS-bits tools are wrapped.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ce7a42b and 76e6bcc.

📒 Files selected for processing (1)
  • bio/ngsbits/samplesimilarity/meta.yaml (1 hunks)
🔇 Additional comments (3)
bio/ngsbits/samplesimilarity/meta.yaml (3)

1-5: LGTM! Basic metadata is well-structured.

The name, description, URL, and author information are clear and accurate.


9-9: Consider renaming 'regions' parameter for consistency

For consistency with other wrappers in the collection, consider renaming the regions parameter.


12-13: 🛠️ Refactor suggestion

Enhance parameters documentation

The parameters section needs more comprehensive documentation:

  1. Available modes and their purposes
  2. Common parameters
  3. Database connection parameters (mentioned in PR objectives)
 params:
-  - extra: Optional parameters besides IO, or `-mode`.
+  - mode: |
+      Analysis mode (required):
+      - vcf: Compare variants from VCF files
+      - bam: Compare coverage from BAM/SAM/CRAM files
+  - db_connection: |
+      Optional database connection parameters for NGSD:
+      - host: MySQL host
+      - user: MySQL user
+      - password: MySQL password
+      - database: NGSD database name
+  - extra: |
+      Additional optional parameters:
+      - min_overlap: Minimum overlap between samples (default: 10)
+      - min_af: Minimum allele frequency (default: 0.2)
+      See tool documentation for complete list.

Likely invalid or redundant comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
bio/ngsbits/samplesimilarity/wrapper.py (1)

21-24: Consider aligning variable name with input key for clarity.

While the implementation is correct, consider renaming the variable to match the input key for better readability:

-roi = snakemake.input.get("regions")
-if roi:
-    extra += f" -roi {roi:q}"
+regions = snakemake.input.get("regions")
+if regions:
+    extra += f" -roi {regions:q}"
🧰 Tools
🪛 Ruff (0.8.0)

21-21: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 76e6bcc and 53f38af.

📒 Files selected for processing (1)
  • bio/ngsbits/samplesimilarity/wrapper.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
bio/ngsbits/samplesimilarity/wrapper.py (2)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.


Pattern **/wrapper.py: Do not complain about use of undefined variable called snakemake.

📓 Learnings (1)
bio/ngsbits/samplesimilarity/wrapper.py (1)
Learnt from: tdayris
PR: snakemake/snakemake-wrappers#3500
File: bio/ngsbits/samplesimilarity/wrapper.py:0-0
Timestamp: 2024-11-26T09:23:03.628Z
Learning: In the `SampleSimilarity` wrapper (`wrapper.py`), the `input_files` parameter can contain multiple files. Adding quotes around `{input_files}` in the shell command should be avoided, as it may interfere with multiple inputs.
🪛 Ruff (0.8.0)
bio/ngsbits/samplesimilarity/wrapper.py

12-12: Undefined name snakemake

(F821)


16-16: Undefined name snakemake

(F821)


17-17: Undefined name snakemake

(F821)


21-21: Undefined name snakemake

(F821)


25-25: Undefined name snakemake

(F821)

🔇 Additional comments (5)
bio/ngsbits/samplesimilarity/wrapper.py (5)

1-11: LGTM! Clean and well-documented header section.

The file header is properly structured with all necessary metadata and minimal imports.


12-16: LGTM! Proper logging and parameter initialization.

The logging configuration captures both stdout and stderr, which is essential for debugging and monitoring tool execution.

🧰 Tools
🪛 Ruff (0.8.0)

12-12: Undefined name snakemake

(F821)


16-16: Undefined name snakemake

(F821)


17-20: LGTM! Proper reference file handling with shell quoting.

The implementation correctly handles reference files with proper shell quoting using the :q modifier.

🧰 Tools
🪛 Ruff (0.8.0)

17-17: Undefined name snakemake

(F821)


33-39: LGTM! Clean shell command construction.

The shell command is properly constructed with:

  • Correct spacing for readability
  • Proper handling of multiple input files (unquoted as required)
  • Shell quoting for output path

25-31: Verify handling of mixed input file types.

The implementation assumes all input files are of the same type. Let's verify this assumption is documented and handled appropriately.

✅ Verification successful

Based on the gathered information, I can now generate the final response:

Mixed input file types are handled appropriately through default mode

The implementation correctly handles file type consistency by checking all input files against specific extensions. If files are neither all VCF nor all BAM/SAM/CRAM, it defaults to GSVAR mode, which is the expected behavior. The test files demonstrate proper usage with consistent file types (VCF files), and there's no indication in the codebase that mixed file types should be supported.

  • The wrapper's behavior aligns with the tool's design, where input files should be of the same format for meaningful sample similarity comparison
  • Test files in bio/ngsbits/samplesimilarity/test/ demonstrate the expected usage with consistent file types (a.vcf and b.vcf)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test cases or documentation about mixed file types
rg -l "test.*samplesimilarity.*mixed" .
rg -l "samplesimilarity.*different.*format" .

# Check if there are any test files with mixed formats
fd -e vcf -e bam -e sam -e cram -e gsvar . test/

Length of output: 233


Script:

#!/bin/bash
# Search for test files and documentation in the entire repository
fd "test" .

# Look for any documentation about input formats
rg -i "samplesimilarity.*input.*format" .
rg -i "samplesimilarity.*file.*type" .

# Check the tool's documentation
fd "README|MANUAL|DOC" . -t f -i | grep -i "samplesimilarity" -B1 -A1

# Look for error handling related to input formats
rg "raise.*Error.*format" .
rg "sys\.exit.*format" .

Length of output: 18349


Script:

#!/bin/bash
# Look for test files and implementation details in the samplesimilarity directory
fd . bio/ngsbits/samplesimilarity/

# Check the content of test files if they exist
fd . bio/ngsbits/samplesimilarity/test/ -t f -x cat {}

# Look for error handling in the wrapper
rg "raise|sys\.exit" bio/ngsbits/samplesimilarity/wrapper.py

# Look for any documentation about input formats in the wrapper
rg -A 5 -B 5 "input.*files.*=.*snakemake" bio/ngsbits/samplesimilarity/wrapper.py

Length of output: 4415

🧰 Tools
🪛 Ruff (0.8.0)

25-25: Undefined name snakemake

(F821)

@fgvieira fgvieira merged commit 710597c into snakemake:master Dec 2, 2024
7 checks passed
johanneskoester pushed a commit that referenced this pull request Dec 6, 2024
🤖 I have created a release \*beep\* \*boop\*
---
##
[5.4.0](https://www.github.com/snakemake/snakemake-wrappers/compare/v5.3.0...v5.4.0)
(2024-12-06)


### Features

* NGS-bits SampleAncestry
([#3502](https://www.github.com/snakemake/snakemake-wrappers/issues/3502))
([8600d44](https://www.github.com/snakemake/snakemake-wrappers/commit/8600d44e79ae4dafa181d3b06eed6e3db0c7a2df))
* NGS-bits SampleSimilarity
([#3500](https://www.github.com/snakemake/snakemake-wrappers/issues/3500))
([710597c](https://www.github.com/snakemake/snakemake-wrappers/commit/710597cdc4e7f518d1fda2ec246bb6a7e0e29ba9))
* NGSCheckMate make pattern
([#3499](https://www.github.com/snakemake/snakemake-wrappers/issues/3499))
([3b96cc1](https://www.github.com/snakemake/snakemake-wrappers/commit/3b96cc18b5a7ce8643fc3f8a492333d8f339e4c5))
* Sex.DetERRmine
([#3497](https://www.github.com/snakemake/snakemake-wrappers/issues/3497))
([3919f2e](https://www.github.com/snakemake/snakemake-wrappers/commit/3919f2e4b6fae381cf92c921e6f17086819de345))


### Performance Improvements

* autobump bio/bbtools
([#3507](https://www.github.com/snakemake/snakemake-wrappers/issues/3507))
([19d027d](https://www.github.com/snakemake/snakemake-wrappers/commit/19d027d176bfa5da7ec2d75b9222547b0fd2b919))
* autobump bio/busco
([#3506](https://www.github.com/snakemake/snakemake-wrappers/issues/3506))
([aad4b56](https://www.github.com/snakemake/snakemake-wrappers/commit/aad4b56df0ca427a0a12dba67bcf0edef51d545b))
* autobump bio/busco
([#3519](https://www.github.com/snakemake/snakemake-wrappers/issues/3519))
([6af2e11](https://www.github.com/snakemake/snakemake-wrappers/commit/6af2e11535a6407a0a60eaa18f7e15548e4a4a01))
* autobump bio/encode_fastq_downloader
([#3521](https://www.github.com/snakemake/snakemake-wrappers/issues/3521))
([cbf06d2](https://www.github.com/snakemake/snakemake-wrappers/commit/cbf06d227d2cece579b1bdff413d09036e2d976f))
* autobump bio/freebayes
([#3509](https://www.github.com/snakemake/snakemake-wrappers/issues/3509))
([12b8b3c](https://www.github.com/snakemake/snakemake-wrappers/commit/12b8b3ce9d5be65a2165b4ca3e0403935b950237))
* autobump bio/gatk3/baserecalibrator
([#3523](https://www.github.com/snakemake/snakemake-wrappers/issues/3523))
([7a7518e](https://www.github.com/snakemake/snakemake-wrappers/commit/7a7518e63e0c6eac8d7cb935808b8692e4e688ff))
* autobump bio/gatk3/indelrealigner
([#3525](https://www.github.com/snakemake/snakemake-wrappers/issues/3525))
([a0d913c](https://www.github.com/snakemake/snakemake-wrappers/commit/a0d913ce81ceb45aeff0eeee8e1e92e63bda786c))
* autobump bio/gatk3/printreads
([#3524](https://www.github.com/snakemake/snakemake-wrappers/issues/3524))
([67af9a6](https://www.github.com/snakemake/snakemake-wrappers/commit/67af9a6899872a5e2e8cabc58572aa31d51a43cc))
* autobump bio/gatk3/realignertargetcreator
([#3522](https://www.github.com/snakemake/snakemake-wrappers/issues/3522))
([5f8ffe7](https://www.github.com/snakemake/snakemake-wrappers/commit/5f8ffe7349ab24d55d45e1811519b2b9e9985068))
* autobump bio/hifiasm
([#3510](https://www.github.com/snakemake/snakemake-wrappers/issues/3510))
([2b1b9f2](https://www.github.com/snakemake/snakemake-wrappers/commit/2b1b9f265231a3f3bc121dd9d4f34111b15d4486))
* autobump bio/mapdamage2
([#3526](https://www.github.com/snakemake/snakemake-wrappers/issues/3526))
([92da252](https://www.github.com/snakemake/snakemake-wrappers/commit/92da252bcafd05a0187b635b1938593aa4268c3b))
* autobump bio/mosdepth
([#3511](https://www.github.com/snakemake/snakemake-wrappers/issues/3511))
([762b273](https://www.github.com/snakemake/snakemake-wrappers/commit/762b273800120519ffd2bc2f670ae93be6187cac))
* autobump bio/mtnucratio
([#3512](https://www.github.com/snakemake/snakemake-wrappers/issues/3512))
([7f6a3b0](https://www.github.com/snakemake/snakemake-wrappers/commit/7f6a3b07cc2bae16fb30c097e783b70e87ad58f1))
* autobump bio/ngsbits/sampleancestry
([#3527](https://www.github.com/snakemake/snakemake-wrappers/issues/3527))
([2abf38c](https://www.github.com/snakemake/snakemake-wrappers/commit/2abf38c30e541dd45563ee7ec3959ccf43802fab))
* autobump bio/ngsbits/samplesimilarity
([#3529](https://www.github.com/snakemake/snakemake-wrappers/issues/3529))
([c91ce10](https://www.github.com/snakemake/snakemake-wrappers/commit/c91ce1075792f231d833a89298e88353c96973b1))
* autobump bio/ngscheckmate/makesnvpattern
([#3528](https://www.github.com/snakemake/snakemake-wrappers/issues/3528))
([ff9a81d](https://www.github.com/snakemake/snakemake-wrappers/commit/ff9a81d923b8efc807f0054400d03c6277777cb8))
* autobump bio/reference/ensembl-mysql-table
([#3513](https://www.github.com/snakemake/snakemake-wrappers/issues/3513))
([6b5c545](https://www.github.com/snakemake/snakemake-wrappers/commit/6b5c5454e86cfd393e5fa55b86566e60ef43dd5c))
* autobump bio/sexdeterrmine
([#3514](https://www.github.com/snakemake/snakemake-wrappers/issues/3514))
([2b18309](https://www.github.com/snakemake/snakemake-wrappers/commit/2b183092fd31225462490d43df5916e678ea5f83))
* autobump bio/spades/metaspades
([#3530](https://www.github.com/snakemake/snakemake-wrappers/issues/3530))
([070b9b6](https://www.github.com/snakemake/snakemake-wrappers/commit/070b9b62af3cc79f7a4fe1500963892631f8d752))
* autobump bio/varlociraptor/call-variants
([#3533](https://www.github.com/snakemake/snakemake-wrappers/issues/3533))
([8c563f1](https://www.github.com/snakemake/snakemake-wrappers/commit/8c563f18f02f3f06760f2f25ad85383f9d776b2d))
* autobump bio/varlociraptor/control-fdr
([#3532](https://www.github.com/snakemake/snakemake-wrappers/issues/3532))
([ce7d1b0](https://www.github.com/snakemake/snakemake-wrappers/commit/ce7d1b00bdf6a856ffe5ed6bac7258ca970cde4a))
* autobump bio/varlociraptor/estimate-alignment-properties
([#3531](https://www.github.com/snakemake/snakemake-wrappers/issues/3531))
([0b5ac04](https://www.github.com/snakemake/snakemake-wrappers/commit/0b5ac04792bcd54984ea6b0e6af41efa33fba126))
* autobump bio/varlociraptor/preprocess-variants
([#3534](https://www.github.com/snakemake/snakemake-wrappers/issues/3534))
([56a8933](https://www.github.com/snakemake/snakemake-wrappers/commit/56a8933de936e20e0068bd1d8cb6bbea3826f655))
* autobump bio/vep/annotate
([#3515](https://www.github.com/snakemake/snakemake-wrappers/issues/3515))
([2609900](https://www.github.com/snakemake/snakemake-wrappers/commit/26099008485dbf5a8054f3284a1892dd1245ac8a))
* autobump bio/vep/cache
([#3516](https://www.github.com/snakemake/snakemake-wrappers/issues/3516))
([f46427c](https://www.github.com/snakemake/snakemake-wrappers/commit/f46427c7951ad1a2ccd142b74698aac3009b2c66))
* autobump bio/vep/plugins
([#3535](https://www.github.com/snakemake/snakemake-wrappers/issues/3535))
([9a6ccc3](https://www.github.com/snakemake/snakemake-wrappers/commit/9a6ccc34ce5db38c419076d7f707321bf69357dc))
* autobump bio/vg/giraffe
([#3517](https://www.github.com/snakemake/snakemake-wrappers/issues/3517))
([6fffdd6](https://www.github.com/snakemake/snakemake-wrappers/commit/6fffdd6caec48f099b6140dd113a295f7202fa63))
* autobump utils/csvtk
([#3508](https://www.github.com/snakemake/snakemake-wrappers/issues/3508))
([41e8545](https://www.github.com/snakemake/snakemake-wrappers/commit/41e8545b5663c91d6b13b5acd2c126a6b42a9a92))
* autobump utils/csvtk
([#3520](https://www.github.com/snakemake/snakemake-wrappers/issues/3520))
([d4a52e5](https://www.github.com/snakemake/snakemake-wrappers/commit/d4a52e5e900aa16cad66ebbea1b22fb27d3709ea))
---


This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants