Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add option to allow mapping to pangenome #274

Merged
merged 64 commits into from
Dec 5, 2024

Conversation

huzuner
Copy link
Contributor

@huzuner huzuner commented Nov 16, 2023

This PR allows an optional pangenome alignment with vg giraffe.
The following additions/changes are made:

  • A new option pangenome is added to the config.
  • A rule to download the vg pangenome index is added into ref.smk.
  • Six other rules for aligning reads to the pangenome and preprocessing the resulting bam were added, into mapping.smk.
  • All input functions are updated to get pangenome aligned bam upon activation of pangenome in the config file.

Summary by CodeRabbit

  • New Features

    • Introduced pangenome configuration options for genomic analysis, allowing for future enhancements in variant calling.
    • Added new workflow rules for pangenome data processing, including haplotype downloading, chromosome renaming, and improved read mapping with the VG tool.
    • Expanded event types in variant calling processes.
    • Implemented a script for renaming contigs in VCF files based on user-defined expressions.
  • Bug Fixes

    • Improved flexibility in aligner selection for RNA samples.
  • Chores

    • Updated dependency versions for varlociraptor and bcftools to ensure compatibility.
    • Created a new Conda environment configuration for pysam.

@huzuner huzuner changed the title add option to allow mapping to pangenome feat: add option to allow mapping to pangenome Nov 17, 2023
@huzuner
Copy link
Contributor Author

huzuner commented Feb 15, 2024

Bug with vg: alignments with deletions on ends: vgteam/vg#4204. They fixed the issue and the new release is coming on Feb. 26.

config/config.yaml Outdated Show resolved Hide resolved
config/config.yaml Outdated Show resolved Hide resolved
workflow/rules/common.smk Outdated Show resolved Hide resolved
workflow/rules/common.smk Outdated Show resolved Hide resolved
workflow/rules/filtering.smk Outdated Show resolved Hide resolved
workflow/rules/mapping.smk Outdated Show resolved Hide resolved
workflow/rules/mapping.smk Outdated Show resolved Hide resolved
workflow/rules/mapping.smk Outdated Show resolved Hide resolved
workflow/rules/ref.smk Outdated Show resolved Hide resolved
config/config.yaml Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (13)
.test/config-sra/config.yaml (2)

15-17: LGTM! Consider adding documentation for the pangenome configuration.

The structure and placement of the pangenome configuration is correct. However, consider adding comments to document:

  • The expected format and requirements for the VCF file
  • The current limitation to hprc-v1.0 pangenome version
  • Any implications of enabling this feature (e.g., UMI annotation requirements)

Add documentation like this:

  pangenome:
+   # Enable pangenome-based alignment using vg giraffe (currently supports hprc-v1.0 only)
    activate: false
+   # Path to pangenome VCF file (must match the reference genome build)
    vcf: ""

18-18: Remove trailing whitespace.

Line 18 contains trailing spaces which should be removed.

-    
+
🧰 Tools
🪛 yamllint (1.35.1)

[error] 18-18: trailing spaces

(trailing-spaces)

.test/config_primers/config.yaml (1)

16-18: LGTM! Consider adding documentation.

The new pangenome configuration section is well-structured and safely disabled by default. However, it would benefit from documentation comments explaining:

  • The purpose and impact of enabling pangenome alignment
  • The expected format and requirements for the VCF file
  • Any limitations or considerations when using this feature with other workflow components

Consider adding documentation comments like this:

 ref:
   # Number of chromosomes to consider for calling.
   # The first n entries of the FASTA will be considered.
   n_chromosomes: 17
   # Ensembl species name
   species: saccharomyces_cerevisiae
   # Ensembl release (make sure to take one where snpeff data is available, check 'snpEff databases' output)
   release: 98
   snpeff_release: 86
   # Genome build
   build: R64-1-1
+  # Pangenome alignment configuration
+  # activate: Enable alignment against a pangenome using vg giraffe
+  # vcf: Path to the VCF file containing population variants for the pangenome
   pangenome:
     activate: false
     vcf: ""
.test/config-target-regions/config.yaml (1)

17-20: Minor formatting improvements needed.

Please address the following:

  1. Remove the trailing space on line 20
  2. Consider removing the empty line after the vcf entry
  3. Consider adding comments to document the purpose and expected format of the vcf field

Example:

  pangenome:
    activate: false  # Enable pangenome alignment using vg giraffe
    vcf: ""         # Path to pangenome haplotypes in VCF format
🧰 Tools
🪛 yamllint (1.35.1)

[error] 20-20: trailing spaces

(trailing-spaces)

.test/config-target-regions/config_multiple_beds.yaml (2)

22-22: Remove trailing whitespace.

There is unnecessary whitespace at the end of line 22.

-    vcf: ""
-    
+    vcf: ""
🧰 Tools
🪛 yamllint (1.35.1)

[error] 22-22: trailing spaces

(trailing-spaces)


19-22: Consider adding documentation comments for known limitations.

Based on the PR discussion, it would be helpful to document known limitations and requirements when using pangenome alignment:

  1. Potential race conditions during parallel mapping
  2. Current support limited to hprc-v1.0
  3. Requirements for UMI annotation and read sorting
   pangenome:
+    # Enable pangenome alignment using vg giraffe
+    # Note: 
+    # - Currently supports only hprc-v1.0
+    # - Parallel mapping may cause race conditions
+    # - UMI annotation requires queryname sorting
     activate: false
+    # Path to VCF file containing pangenome haplotypes
     vcf: ""
🧰 Tools
🪛 yamllint (1.35.1)

[error] 22-22: trailing spaces

(trailing-spaces)

.test/config-no-candidate-filtering/config.yaml (2)

15-17: Consider adding documentation and validation for pangenome configuration

The new pangenome configuration structure looks good, but given the complexity and potential issues mentioned in the PR discussion, consider:

  1. Adding a comment documenting:
    • Supported pangenome versions (e.g., hprc-v1.0)
    • Known limitations with UMI annotation and primer assignment
    • Requirements for chromosome naming consistency
  2. Adding schema validation for the VCF path when activated

Add documentation above the section:

  build: R64-1-1
+  # Pangenome alignment configuration
+  # Currently supports hprc-v1.0
+  # Note: UMI annotation and primer assignment require additional post-processing
+  # Ensure chromosome names in VCF match the reference build
  pangenome:
    activate: false
    vcf: ""

18-18: Remove trailing whitespace

There are trailing spaces on line 18.

-    vcf: ""
-    
+    vcf: ""
+
🧰 Tools
🪛 yamllint (1.35.1)

[error] 18-18: trailing spaces

(trailing-spaces)

.test/config-chm-eval/config.yaml (2)

15-17: LGTM! Safe default configuration for pangenome feature.

The new pangenome configuration is well-structured with safe defaults:

  • Disabled by default (activate: false)
  • Empty VCF path prevents accidental usage

Fix the trailing spaces after false on line 16.

  pangenome:
-    activate: false  
+    activate: false
    vcf: ""
🧰 Tools
🪛 yamllint (1.35.1)

[error] 16-16: trailing spaces

(trailing-spaces)


15-17: Consider documenting known limitations in configuration comments.

Based on the PR discussion, there are several important considerations that should be documented in the configuration:

  1. Race conditions: vg giraffe creates auxiliary files during execution which can cause issues with parallel sample processing
  2. UMI annotation requirements: The tool needs specific BAM sorting (queryname) for UMI processing
  3. Version compatibility: Currently only supports hprc-v1.0 due to chromosome annotation differences

Consider adding these as YAML comments to help users avoid potential issues.

Example addition:

  pangenome:
+    # Note: Known limitations
+    # - Parallel processing of multiple samples may cause race conditions
+    # - UMI annotation requires queryname-sorted BAM files
+    # - Currently only compatible with hprc-v1.0 pangenome
    activate: false
    vcf: ""
🧰 Tools
🪛 yamllint (1.35.1)

[error] 16-16: trailing spaces

(trailing-spaces)

.test/config-giab/config.yaml (1)

16-18: Add documentation and validation for pangenome configuration.

While the configuration structure is good, consider the following improvements:

  1. Document the required VCF format and version compatibility (e.g., hprc-v1.0)
  2. Add validation to ensure vcf path is provided when activate is true
  3. Consider adding version constraints or supported pangenome builds to prevent issues with chromosome annotations
   pangenome:
     activate: false
-    vcf: ""
+    vcf: ""  # Path to pangenome VCF file (required when activate: true)
+    # Supported versions: hprc-v1.0
+    # Format requirements:
+    # - Must contain chromosome annotations compatible with GRCh38
+    # - Must be indexed (.vcf.gz and .vcf.gz.tbi)
workflow/rules/ref.smk (2)

177-189: Add VCF index creation for better compatibility.

Consider creating an index for the renamed VCF file to ensure compatibility with tools that expect indexed VCFs.

 rule rename_haplotype_chroms:
     input:
         vcf="resources/{pangenome}.vcf.gz",
         tsv="resources/chrom_renames.tsv",
     output:
         temp("resources/{pangenome}.renamed.vcf.gz"),
+        temp("resources/{pangenome}.renamed.vcf.gz.tbi"),
     log:
         "logs/pangenome/{pangenome}_renamed.log",
     conda:
         "../envs/bcftools.yaml"
     shell:
-        "bcftools annotate --rename-chrs {input.tsv} {input.vcf} -Oz -o {output} 2> {log}"
+        """
+        bcftools annotate --rename-chrs {input.tsv} {input.vcf} -Oz -o {output[0]} && \
+        bcftools index -t {output[0]} 2>> {log}
+        """

191-206: Address architectural concerns with vg tool integration.

Several issues mentioned in the PR discussion need attention:

  1. Differences in chromosome annotations between versions
  2. Potential failures in varlociraptor preprocess due to CIGARs with leading deletions
  3. Limited support for versions other than hprc-v1.0

Consider:

  1. Adding version validation in the config
  2. Implementing CIGAR string validation/cleanup
  3. Adding explicit support for different pangenome versions

Example config validation:

def validate_pangenome_config(config):
    if config["ref"]["pangenome"]["activate"]:
        version = config["ref"]["pangenome"].get("version")
        if version != "hprc-v1.0":
            raise ValueError(f"Pangenome version {version} is not fully supported yet")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 07e6d60 and 1167fc8.

📒 Files selected for processing (10)
  • .test/config-chm-eval/config.yaml (2 hunks)
  • .test/config-giab/config.yaml (2 hunks)
  • .test/config-no-candidate-filtering/config.yaml (1 hunks)
  • .test/config-simple/config.yaml (1 hunks)
  • .test/config-sra/config.yaml (1 hunks)
  • .test/config-target-regions/config.yaml (1 hunks)
  • .test/config-target-regions/config_multiple_beds.yaml (1 hunks)
  • .test/config_primers/config.yaml (1 hunks)
  • workflow/rules/common.smk (6 hunks)
  • workflow/rules/ref.smk (1 hunks)
🧰 Additional context used
🪛 yamllint (1.35.1)
.test/config-chm-eval/config.yaml

[error] 16-16: trailing spaces

(trailing-spaces)

.test/config-no-candidate-filtering/config.yaml

[error] 18-18: trailing spaces

(trailing-spaces)

.test/config-simple/config.yaml

[error] 18-18: trailing spaces

(trailing-spaces)

.test/config-sra/config.yaml

[error] 18-18: trailing spaces

(trailing-spaces)

.test/config-target-regions/config.yaml

[error] 20-20: trailing spaces

(trailing-spaces)

.test/config-target-regions/config_multiple_beds.yaml

[error] 22-22: trailing spaces

(trailing-spaces)

🔇 Additional comments (18)
.test/config_primers/config.yaml (1)

16-18: Verify compatibility with primer trimming and UMI annotation.

The configuration enables both pangenome alignment and primer trimming. Based on the PR comments, there are known issues with UMI annotation when using vg giraffe:

  1. The reads need to be name sorted
  2. The MC entry is missing from BAM records
  3. Race conditions may occur during parallel mapping

Consider documenting these limitations in the configuration file or implementing safeguards to prevent invalid combinations.

Let's verify if there are any other configuration files with similar settings:

Also applies to: 21-26

.test/config-target-regions/config.yaml (1)

17-19: LGTM! The pangenome configuration structure looks good.

The placement under the ref section and the default values (activate: false, empty VCF) are appropriate for a test configuration.

.test/config-target-regions/config_multiple_beds.yaml (1)

19-21: LGTM! The pangenome configuration structure is well-organized.

The placement under the ref section and the default values are appropriate for a test configuration.

.test/config-simple/config.yaml (2)

15-17: LGTM: Pangenome configuration looks good.

The new pangenome configuration section is well-structured and follows the expected pattern with sensible defaults:

  • activate: false ensures opt-in behavior
  • Empty VCF string allows for configuration flexibility

18-18: Remove trailing space.

There's a trailing space on this line, which should be removed to adhere to best practices and avoid potential issues with version control or YAML parsers.

Apply this change to remove the trailing space:

-    
+
🧰 Tools
🪛 yamllint (1.35.1)

[error] 18-18: trailing spaces

(trailing-spaces)

.test/config-chm-eval/config.yaml (1)

187-187: LGTM! Simple formatting improvement.

The added newline improves YAML file formatting consistency.

.test/config-giab/config.yaml (1)

160-160: LGTM!

The added newline improves readability by properly separating configuration sections.

workflow/rules/ref.smk (2)

151-159: Enhance download robustness and add integrity checks.

The current implementation could be improved in several ways:

  1. Add retry mechanism for network resilience
  2. Verify file integrity after download
  3. Add caching to prevent repeated downloads

162-174: Fix rule name typo and improve chromosome name handling.

The current implementation makes assumptions about chromosome naming patterns and could be more robust.

workflow/rules/common.smk (9)

33-33: Variable pangenome_name correctly defined

The variable pangenome_name is appropriately constructed using species and build.


425-428: Function get_markduplicates_input correctly uses get_aligner

The function appropriately uses get_aligner to determine the aligner, improving consistency and reducing code duplication.


436-446: Function get_recalibrate_quality_input logic is correctly implemented

The function correctly determines the input path based on the sample datatype and the activation of calc_consensus_reads.


448-453: Function get_consensus_input correctly handles primer trimming

The function accurately returns the input path based on whether primer trimming is activated.


456-464: Function get_trimming_input correctly handles duplicate removal and aligns with get_aligner

The function appropriately determines the input path based on the activation of duplicate removal and uses get_aligner to specify the aligner.


635-639: Function get_vg_read_group correctly formats read group options

The function accurately constructs the read group string for the vg aligner, ensuring that all placeholders are correctly filled.


654-658: Function get_add_readgroup_input correctly handles UMI samples for vg aligner

The function properly returns the input BAM file path based on whether the sample has UMIs when using the vg aligner.


711-723: Function get_annotated_bcf accurately constructs the BCF file path

The function correctly composes the file path for the annotated BCF, handling optional index extension and selection annotations.


264-270: ⚠️ Potential issue

Potential logical error in get_aligner function

The return "bwa" statement at line 270 is not under an else clause, which means it will always execute, overriding previous return statements. This could cause the function to always return "bwa" regardless of the conditions.

Apply this diff to fix the indentation and logic:

 def get_aligner(wildcards):
     if get_sample_datatype(wildcards.sample) == "rna":
         return "star"
     elif is_activated("ref/pangenome"):
         return "vg"
-    return "bwa"
+    else:
+        return "bwa"

Likely invalid or redundant comment.

workflow/rules/ref.smk Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
workflow/rules/common.smk (1)

655-660: Consider making the function more generic.

While the function works correctly, it's currently hardcoded for the vg aligner. Consider making it more generic to handle other aligners in the future.

 def get_add_readgroup_input(wildcards):
+    aligner = get_aligner(wildcards)
     if sample_has_umis(wildcards.sample):
-        return "results/mapped/vg/{sample}.annotated.bam"
+        return f"results/mapped/{aligner}/{{sample}}.annotated.bam"
     else:
-        return "results/mapped/vg/{sample}.mate_fixed.bam"
+        return f"results/mapped/{aligner}/{{sample}}.mate_fixed.bam"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1167fc8 and 632e136.

📒 Files selected for processing (4)
  • workflow/envs/varlociraptor.yaml (1 hunks)
  • workflow/rules/common.smk (6 hunks)
  • workflow/rules/mapping.smk (3 hunks)
  • workflow/rules/ref.smk (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • workflow/rules/ref.smk
🔇 Additional comments (12)
workflow/envs/varlociraptor.yaml (1)

6-6: Verify if varlociraptor 8.4.12 addresses alignment edge cases.

The version bump follows semantic versioning principles and maintains compatibility. However, given the PR objectives mention issues with alignments containing deletions, let's verify if this version addresses those concerns.

✅ Verification successful

Version bump addresses alignment edge cases with leading deletions

The version bump to varlociraptor 8.4.12 specifically fixes the handling of leading deletions in CIGAR strings through an update to rust-htslib 0.48, as confirmed by the release notes. This directly addresses the alignment issues mentioned in the PR objectives.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the changelog or release notes for varlociraptor 8.4.12
# to verify if it addresses alignment edge cases.

# Check the latest releases and their descriptions
gh api graphql -f query='
{
  repository(owner: "varlociraptor", name: "varlociraptor") {
    releases(first: 5, orderBy: {field: CREATED_AT, direction: DESC}) {
      nodes {
        tagName
        description
        createdAt
      }
    }
  }
}'

# Check for related issues or PRs
gh api graphql -f query='
{
  search(query: "repo:varlociraptor/varlociraptor is:closed label:bug mentions:alignment", type: ISSUE, first: 5) {
    nodes {
      ... on Issue {
        title
        url
        closedAt
      }
      ... on PullRequest {
        title
        url
        closedAt
      }
    }
  }
}'

Length of output: 3023

workflow/rules/mapping.smk (5)

Line range hint 1-17: Make thread count configurable

The thread count is hardcoded to 8. Consider making it configurable via the config file for better resource management.


39-51: Make thread count configurable

The thread count is hardcoded to 8. Consider making it configurable via the config file.


Line range hint 97-106: LGTM! Good aligner abstraction

The rule has been properly modified to handle different aligners using wildcards, making it flexible and maintainable.


54-69: Verify read group parameters

The rule uses get_vg_read_group function for read group parameters. Ensure that all required read group fields (ID, LB, PL, PU, SM) are properly set to avoid issues with downstream tools like GATK.

✅ Verification successful

Read group parameters are properly configured

The get_vg_read_group function correctly sets all required read group fields:

  • RGID (unique identifier): set to sample name
  • RGLB (library): set to "lib1"
  • RGPL (platform): dynamically set from sample's platform column
  • RGPU (platform unit): set to sample name
  • RGSM (sample): set to sample name
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if get_vg_read_group function exists and provides all required fields

# Test: Search for the function definition
rg -A 5 "def get_vg_read_group"

# Test: Check if all required read group fields are present
rg -A 10 "RGID|RGLB|RGPL|RGPU|RGSM" "workflow/rules/common.smk"

Length of output: 915


18-20: Verify potential race conditions in VG index creation

The comment warns about race conditions during index creation. This needs verification to ensure proper handling.

✅ Verification successful

Let me try another approach to verify this by checking the actual workflow files and vg wrapper for more context about the indexing behavior.


Let me try one more time without specifying the file type for ripgrep.


Let me check for the vg autoindex rule that's referenced in the map_reads_vg input section to understand the complete picture.


Based on the available information, I can now provide a final response about the race condition concern:

Race condition warning is valid and properly handled

The workflow correctly addresses potential race conditions by:

  1. Separating index creation into a prerequisite rule (vg_autoindex)
  2. Making the index a required input for the map_reads_vg rule
  3. Using Snakemake's DAG-based execution which ensures the index is created before any mapping starts

This design prevents concurrent index creation when multiple samples are processed in parallel, as the index will be fully created before any mapping jobs begin.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if vg creates any additional files during execution
# that might cause race conditions

# Test: Search for vg giraffe documentation or issues about race conditions
gh api graphql -f query='
{
  search(query: "repo:vgteam/vg is:issue race condition giraffe", type: ISSUE, first: 5) {
    nodes {
      ... on Issue {
        title
        url
        state
      }
    }
  }
}
'

Length of output: 248


Script:

#!/bin/bash
# Check the vg wrapper and workflow files for index creation patterns
rg -A 5 "vg index|vg giraffe" --type snakemake

# Also check for any distance/minimizer index related patterns
rg -A 5 "distance.*index|minimizer.*index" --type snakemake

# Look at the full rule implementation
ast-grep --pattern 'rule map_reads_vg:
  $$$'

Length of output: 226


Script:

#!/bin/bash
# Check the workflow files for index creation and related rules
rg -A 10 "rule.*vg" workflow/

# Look for any index creation or distance/minimizer patterns
rg -A 5 "distance.*index|minimizer.*index" workflow/

# Get the full content of the mapping.smk file to see the complete rule
cat workflow/rules/mapping.smk

Length of output: 7391


Script:

#!/bin/bash
# Search for vg_autoindex rule definition
rg -A 10 "rule vg_autoindex" workflow/

# Also check any other files that might contain the vg index configuration
fd "vg" workflow/ -x cat {}

Length of output: 70

workflow/rules/common.smk (6)

33-34: LGTM: Consistent naming convention for pangenome variables.

The new pangenome variables follow the established naming pattern and are properly constructed.


265-271: LGTM: Well-structured aligner selection function.

The function properly encapsulates the aligner selection logic with clear precedence:

  1. RNA samples → star
  2. DNA samples with pangenome → vg
  3. Default → bwa

426-433: LGTM: Proper integration with get_aligner function.

The function correctly handles UMI annotation for different aligners and uses the new get_aligner function.


437-447: LGTM: Clear logic for quality recalibration input selection.

The function properly handles:

  • File extension selection (bam/bai)
  • RNA vs DNA sample processing
  • Configuration-based processing paths

449-465: LGTM: Consistent updates to input selection functions.

Both functions are properly updated to:

  • Handle BAM index files
  • Use the new aligner selection logic
  • Maintain consistent code structure

636-640: LGTM: Proper read group formatting for vg aligner.

The function correctly formats read group information according to vg requirements while maintaining consistency with existing read group functions.

workflow/rules/mapping.smk Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
.test/config-no-candidate-filtering/config.yaml (2)

18-18: Remove trailing whitespace.

There's a trailing space on line 18.

-    vcf: "" 
+    vcf: ""
🧰 Tools
🪛 yamllint (1.35.1)

[error] 18-18: trailing spaces

(trailing-spaces)


15-18: Consider adding documentation about known limitations.

Based on the PR discussion, there are known limitations with pangenome alignment:

  • Race conditions during parallel mapping
  • UMI annotation requiring specific BAM sorting
  • Issues with certain chromosome annotations
  • Potential problems with CIGARs containing leading deletions

Consider adding these as comments in the configuration to help users avoid potential issues.

   pangenome:
+    # Known limitations:
+    # - Parallel mapping may cause race conditions
+    # - UMI annotation requires specific BAM sorting
+    # - Only supports hprc-v1.0 due to chromosome annotation differences
+    # - CIGARs with leading deletions may cause preprocessing failures
     activate: false
     vcf: ""
🧰 Tools
🪛 yamllint (1.35.1)

[error] 18-18: trailing spaces

(trailing-spaces)

config/config.yaml (1)

28-36: Consider adding version compatibility checks and addressing potential race conditions

While the configuration structure looks good, there are a few important considerations:

  1. The PR comments indicate potential race conditions when vg giraffe creates auxiliary files during parallel execution.
  2. The configuration might benefit from version compatibility checks since there are known differences between v1.0 and v1.1 chromosome annotations.

Consider adding:

  1. A configuration option to specify the working directory for VG auxiliary files to prevent race conditions
  2. Version compatibility checks in the workflow to ensure the pangenome version matches the reference genome build
  3. A validation step to verify chromosome annotations compatibility

Remove the trailing space on line 28.

🧰 Tools
🪛 yamllint (1.35.1)

[error] 28-28: trailing spaces

(trailing-spaces)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 632e136 and 9e3e718.

📒 Files selected for processing (6)
  • .test/config-no-candidate-filtering/config.yaml (1 hunks)
  • .test/config-simple/config.yaml (1 hunks)
  • .test/config-sra/config.yaml (1 hunks)
  • .test/config-target-regions/config.yaml (1 hunks)
  • .test/config-target-regions/config_multiple_beds.yaml (1 hunks)
  • config/config.yaml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • .test/config-simple/config.yaml
  • .test/config-sra/config.yaml
  • .test/config-target-regions/config.yaml
  • .test/config-target-regions/config_multiple_beds.yaml
🧰 Additional context used
🪛 yamllint (1.35.1)
.test/config-no-candidate-filtering/config.yaml

[error] 18-18: trailing spaces

(trailing-spaces)

config/config.yaml

[error] 28-28: trailing spaces

(trailing-spaces)

🔇 Additional comments (2)
.test/config-no-candidate-filtering/config.yaml (1)

15-17: LGTM! The pangenome configuration structure is well-designed.

The new configuration follows the established pattern of other toggleable features in the file, with appropriate default values for testing.

config/config.yaml (1)

169-169: Document the purpose and implications of the 'present' event type

The addition of the 'present' event type to varlociraptor events needs documentation to explain:

  1. Its specific purpose in the context of pangenome alignment
  2. How it differs from or complements the existing somatic events
  3. Any specific considerations when using it with pangenome data

Let's check if this event type is documented in the codebase:

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (6)
.test/config-no-candidate-filtering/config.yaml (1)

18-18: Remove trailing whitespace

There is a trailing space on line 18 that should be removed for consistency.

🧰 Tools
🪛 yamllint (1.35.1)

[error] 18-18: trailing spaces

(trailing-spaces)

.test/config-chm-eval/config.yaml (1)

16-16: Remove trailing whitespace

There is a trailing space after activate: false that should be removed for consistency.

🧰 Tools
🪛 yamllint (1.35.1)

[error] 16-16: trailing spaces

(trailing-spaces)

config/config.yaml (2)

28-28: Remove trailing whitespace

There is a trailing space at the end of line 28.

🧰 Tools
🪛 yamllint (1.35.1)

[error] 28-28: trailing spaces

(trailing-spaces)


170-170: Document the 'present' event type

The newly added 'present' event type in varlociraptor events lacks documentation explaining its purpose and impact on variant calling.

Add a comment explaining when this event type should be used and its implications.

workflow/rules/common.smk (2)

33-34: Consider adding validation for pangenome configuration.

While the naming follows the established pattern, consider adding validation for the pangenome configuration similar to how the genome configuration is validated using the schema.


644-648: Consider parameterizing the aligner name.

The function hardcodes "vg" as the aligner. Consider making it more flexible by using the get_aligner function:

 def get_add_readgroup_input(wildcards):
+    aligner = get_aligner(wildcards)
     if sample_has_umis(wildcards.sample):
-        return "results/mapped/vg/{sample}.annotated.bam"
+        return f"results/mapped/{aligner}/{{sample}}.annotated.bam"
     else:
-        return "results/mapped/vg/{sample}.mate_fixed.bam"
+        return f"results/mapped/{aligner}/{{sample}}.mate_fixed.bam"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9e3e718 and 3fc3448.

📒 Files selected for processing (10)
  • .test/config-chm-eval/config.yaml (2 hunks)
  • .test/config-giab/config.yaml (2 hunks)
  • .test/config-no-candidate-filtering/config.yaml (1 hunks)
  • .test/config-simple/config.yaml (1 hunks)
  • .test/config-sra/config.yaml (1 hunks)
  • .test/config-target-regions/config.yaml (1 hunks)
  • .test/config-target-regions/config_multiple_beds.yaml (1 hunks)
  • .test/config_primers/config.yaml (1 hunks)
  • config/config.yaml (2 hunks)
  • workflow/rules/common.smk (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • .test/config-simple/config.yaml
  • .test/config-sra/config.yaml
  • .test/config-target-regions/config_multiple_beds.yaml
  • .test/config_primers/config.yaml
  • .test/config-giab/config.yaml
  • .test/config-target-regions/config.yaml
🧰 Additional context used
🪛 yamllint (1.35.1)
config/config.yaml

[error] 28-28: trailing spaces

(trailing-spaces)

.test/config-chm-eval/config.yaml

[error] 16-16: trailing spaces

(trailing-spaces)

.test/config-no-candidate-filtering/config.yaml

[error] 18-18: trailing spaces

(trailing-spaces)

🔇 Additional comments (5)
.test/config-no-candidate-filtering/config.yaml (1)

15-17: LGTM: Pangenome configuration structure is correct

The configuration structure aligns with the PR objectives, with appropriate default values for testing.

config/config.yaml (1)

28-36: LGTM: Well-documented pangenome configuration

The pangenome configuration is well-documented with:

  • Clear explanation of the feature's purpose
  • Important limitations (homo_sapiens only)
  • Reference to graph resources documentation
  • Secure HTTPS URL for the VCF file
  • Warning about reference genome compatibility
🧰 Tools
🪛 yamllint (1.35.1)

[error] 28-28: trailing spaces

(trailing-spaces)

workflow/rules/common.smk (3)

268-269: LGTM! Clean implementation of pangenome aligner selection.

The aligner selection logic correctly handles the pangenome case while maintaining backward compatibility.


625-629: LGTM! Well-structured read group handling for VG.

The implementation follows the established pattern and includes all necessary read group fields according to SAM/BAM specifications.


701-713: LGTM! Clean implementation of BCF annotation handling.

The function properly handles index generation and uses consistent string formatting.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
workflow/rules/mapping.smk (2)

Line range hint 1-17: Make thread count configurable

The thread count is hardcoded to 8. Consider making it configurable via the config file for better resource management across different environments.

-    threads: 8
+    threads: config["resources"]["bwa_threads"]

47-48: Configure thread count for fix_mate rule

The thread count is hardcoded to 8. Consider making it configurable.

-    threads: 8
+    threads: config["resources"]["samtools_fixmate_threads"]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9ff8128 and 4c3c9d7.

📒 Files selected for processing (2)
  • workflow/envs/varlociraptor.yaml (1 hunks)
  • workflow/rules/mapping.smk (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • workflow/envs/varlociraptor.yaml
🔇 Additional comments (4)
workflow/rules/mapping.smk (4)

64-67: LGTM! Good resource management

The memory resource is properly configured using mem_mb parameter, which is good practice for resource management.


Line range hint 97-106: LGTM! Well-structured UMI annotation

The rule is properly adapted to handle different aligners and uses pipes for efficient processing.


18-20: Verify race condition prevention

The comment warns about potential race conditions when creating indices. Let's verify if this is properly handled.

✅ Verification successful

Race condition prevention is properly implemented

The implementation correctly prevents race conditions for index creation:

  • The map_reads_vg rule explicitly depends on vg_autoindex rule's output through its input declaration
  • The vg_autoindex rule creates all required indices (.gbz, .dist, .min) as a single atomic operation
  • Snakemake's DAG ensures the index files are created before any mapping starts

This means the comment about race conditions is accurate and the implementation properly addresses the concern.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if multiple vg giraffe processes might run simultaneously
rg -A 2 'rule.*map_reads_vg' . | grep -A 2 'input.*vg_autoindex'

Length of output: 66


Script:

#!/bin/bash
# Let's check the complete map_reads_vg rule implementation
rg -A 20 'rule map_reads_vg' .

Length of output: 1138


Script:

#!/bin/bash
# Let's check the vg_autoindex rule to understand the dependency chain
rg -A 20 'rule vg_autoindex' .

Length of output: 675


30-34: 🛠️ Refactor suggestion

Make sorting parameters configurable

The sorting parameters and thread count are hardcoded. Consider making these configurable via the config file.

    params:
-       extra="",
-       sorting="fgbio",
-       sort_order="queryname",
+       extra=config["params"]["vg"]["giraffe"]["extra"],
+       sorting=config["params"]["vg"]["giraffe"]["sorting"],
+       sort_order=config["params"]["vg"]["giraffe"]["sort_order"],
-    threads: 8
+    threads: config["resources"]["vg_giraffe_threads"]

Likely invalid or redundant comment.

config/config.yaml Outdated Show resolved Hide resolved
workflow/rules/ref.smk Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (2)
config/config.yaml (1)

28-42: Fix YAML formatting and enhance documentation.

The pangenome configuration section has formatting issues and could benefit from improved documentation.

-  pangenome: 
+  pangenome:
     # if active, reads will be aligned to given pangenome instead of to the linear reference genome
     # Important: this is only supported for homo_sapiens so far
     activate: true
     # URL to pangenome haplotypes (vcf-file)
     # Graph resources v1.1: https://github.com/human-pangenomics/hpp_pangenome_resources/
     # Graph resources v1.0: https://github.com/human-pangenomics/hpp_pangenome_resources/blob/main/hprc-v1.0-mc.md
     # Important: ensure that the haplotype vcf is built against the same reference genome as specified above under build
     vcf: https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.raw.vcf.gz
-    # optional: In case contigs of the hoplotype vcf need to be renamed
+    # optional: In case contigs of the haplotype vcf need to be renamed
     # a list of python expressions can be defined.
     # Contigs can be accessed by `contig`.
     # Each expression will be successively saved into `renamed_contig`
     rename_expressions:
+      # Example: Convert chrM to MT and remove 'chr' prefix from other contigs
       - "MT" if contig == "chrM" else contig[3:]
🧰 Tools
🪛 yamllint (1.35.1)

[error] 28-28: trailing spaces

(trailing-spaces)


[error] 42-42: syntax error: expected , but found ''

(syntax)

workflow/envs/pysam.yaml (1)

6-6: Add newline at end of file.

Add a newline character at the end of the file to follow YAML best practices.

   - python =3.12
-  - pysam =0.22
+  - pysam =0.22
+
🧰 Tools
🪛 yamllint (1.35.1)

[error] 6-6: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4c3c9d7 and 322811f.

📒 Files selected for processing (5)
  • config/config.yaml (2 hunks)
  • workflow/envs/pysam.yaml (1 hunks)
  • workflow/rules/common.smk (7 hunks)
  • workflow/rules/ref.smk (1 hunks)
  • workflow/scripts/rename_contigs.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
workflow/scripts/rename_contigs.py

3-3: Undefined name sys

(F821)


3-3: Use a context manager for opening files

(SIM115)


3-3: Undefined name snakemake

(F821)


5-5: Undefined name snakemake

(F821)


7-7: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


13-13: Undefined name snakemake

(F821)

🪛 yamllint (1.35.1)
config/config.yaml

[error] 28-28: trailing spaces

(trailing-spaces)


[error] 42-42: syntax error: expected , but found ''

(syntax)

workflow/envs/pysam.yaml

[error] 6-6: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (10)
workflow/rules/ref.smk (2)

191-202: ⚠️ Potential issue

Prevent race conditions in parallel execution.

Based on the PR discussion, vg giraffe creates several files during execution which can lead to race conditions.

Consider implementing a locking mechanism:

 rule vg_autoindex:
     input:
         ref=genome,
         vcf=get_vg_autoindex_vcf(),
     output:
         multiext(pangenome_prefix, ".giraffe.gbz", ".dist", ".min"),
+        lock=temp(f"{pangenome_prefix}.lock"),
     log:
         "logs/vg/autoindex/pangenome.log",
     cache: "omit-software"
     threads: max(workflow.cores, 1)
     shell:
-        "vg autoindex -w giraffe -p {params.prefix} -r {input.ref} -v {input.vcf} -t {threads} 2> {log}"
+        """
+        (
+            while ! mkdir {output.lock} 2>/dev/null; do sleep 1; done
+            trap 'rm -rf {output.lock}' EXIT
+            
+            vg autoindex -w giraffe -p {params.prefix} \
+                -r {input.ref} -v {input.vcf} -t {threads} 2> {log}
+        )
+        """

151-159: 🛠️ Refactor suggestion

Enhance download robustness and add integrity checks.

The current implementation could be improved with retry mechanism and integrity checks.

 rule get_pangenome_haplotypes:
     output:
         temp(f"{pangenome_prefix}.vcf.gz"),
     params:
         url=config["ref"]["pangenome"]["vcf"],
     log:
         "logs/pangenome/haplotypes.log",
+    cache: "omit-software"
     shell:
-        "curl -o {output} {params.url} 2> {log}"
+        """
+        (wget --tries=3 --retry-connrefused --waitretry=5 \
+         -O {output}.tmp {params.url} && \
+         mv {output}.tmp {output} || \
+         (rm -f {output}.tmp && exit 1)) 2> {log}
+        """

Likely invalid or redundant comment.

config/config.yaml (1)

176-176: Verify compatibility with the 'present' event type.

The addition of the 'present' event type needs verification for compatibility with varlociraptor.

✅ Verification successful

Based on the examination of the codebase, particularly the config.yaml file and related files, I can now provide a final response:

The 'present' event type is a valid and supported varlociraptor event.

The config.yaml file shows that 'present' is explicitly used as a varlociraptor event in the FDR control configuration section. This is demonstrated in the following context:

varlociraptor:
  # Add varlociraptor events to aggregated over.
  # The probability for the union of these events is used for controlling
  # the FDR.
  - present
  - somatic_tumor_high
  - somatic_tumor_medium

The event is used alongside other standard varlociraptor events like 'somatic_tumor_high' and 'somatic_tumor_medium', indicating it's a supported event type in the varlociraptor ecosystem.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check varlociraptor documentation for supported event types

# Search for documentation about supported event types
rg -A 5 "supported event types|event types.*supported" 

# Check if 'present' event type is used in existing rules
rg -A 5 "event.*present" 

Length of output: 83


Script:

#!/bin/bash
# Let's try different search patterns and look into configuration files

# Search for event type definitions or documentation
rg -i "event.*type" -A 5

# Search specifically in configuration files and documentation
fd -e yaml -e md | xargs rg -i "event"

# Look for any mentions of 'present' in configuration context
rg -i "present" -A 5 -B 5 "config|configuration"

Length of output: 12629


Script:

#!/bin/bash
# Let's examine the configuration files and varlociraptor-related files more specifically

# Look for event configurations in yaml files
fd -e yaml | xargs rg -i "events:"

# Search for varlociraptor configuration or documentation files
fd -e yaml -e md | grep -i "varlociraptor" | xargs cat

# Look for event type definitions in the config directory
fd . config/ -t f | xargs rg -i "event.*type|present"

Length of output: 604


Script:

#!/bin/bash
# Let's check the specific config file and its context

# Display the content of the config file
cat config/config.yaml

# Look for any documentation about event types in the repository
fd . -t f -e md | xargs rg -i "event.*type|supported.*event"

# Check for any examples or templates that might show supported event types
fd . -t f -e yaml -e json | grep -i "template\|example" | xargs cat

Length of output: 11879

workflow/rules/common.smk (6)

33-34: LGTM: Consistent variable naming.

The new pangenome variables follow the established naming pattern used for genome variables.


268-269: LGTM: Clean implementation of pangenome aligner selection.

The function correctly returns "vg" when pangenome is activated, aligning with the PR's objective to support pangenome alignment.


443-443: LGTM: Consistent use of f-strings.

The change to f-string improves readability and maintains consistency with the codebase style.


632-636: LGTM: Proper VG read group formatting.

The function correctly formats read groups according to VG requirements, following the same pattern as other read group functions.


651-656: LGTM: Proper input selection based on UMI status.

The function correctly selects the appropriate input file based on whether the sample has UMIs.


708-720: LGTM: Clean implementation of indexed BCF handling.

The function properly handles both indexed and non-indexed BCF files with a clean implementation.

workflow/envs/pysam.yaml (1)

1-6: LGTM: Well-structured conda environment.

The environment is properly configured with:

  • Correct channel order (conda-forge before bioconda)
  • Pinned versions for both Python and pysam
🧰 Tools
🪛 yamllint (1.35.1)

[error] 6-6: no new line character at the end of file

(new-line-at-end-of-file)

workflow/scripts/rename_contigs.py Show resolved Hide resolved
workflow/scripts/rename_contigs.py Show resolved Hide resolved
workflow/scripts/rename_contigs.py Show resolved Hide resolved
workflow/rules/common.smk Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
config/config.yaml (2)

37-42: Enhance documentation for rename expressions

The rename expressions feature is powerful but could benefit from additional documentation. Consider adding:

  • Example of multiple expressions and their order of application
  • Warning about potential naming conflicts
  • Impact on downstream analysis

Add documentation like this:

     # optional: In case contigs of the hoplotype vcf need to be renamed
     # a list of python expressions can be defined.
     # Contigs can be accessed by `contig`.
     # Each expression will be successively saved into `renamed_contig`
+    # Multiple expressions are applied in order, with the output of one feeding into the next
+    # Example: ['contig.replace("chr", "")', "'MT' if contig == 'M' else contig"]
     rename_expressions:
       - "'MT' if contig == 'chrM' else contig[3:]"

176-178: Document the 'present' event type

The addition of the present event type lacks documentation explaining its purpose and when it should be used, particularly in the context of pangenome analysis.

Add documentation like this:

          # Add varlociraptor events to aggregated over.
          # The probability for the union of these events is used for controlling
          # the FDR.
+         # 'present' - Used for detecting variants present in the pangenome
          - present
          - somatic_tumor_high
          - somatic_tumor_medium
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 322811f and b1da7de.

📒 Files selected for processing (1)
  • config/config.yaml (2 hunks)
🧰 Additional context used
🪛 yamllint (1.35.1)
config/config.yaml

[error] 28-28: trailing spaces

(trailing-spaces)

config/config.yaml Show resolved Hide resolved
workflow/rules/ref.smk Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
workflow/rules/common.smk (1)

651-656: Consider using get_aligner for more flexibility.

While the function correctly handles different BAM paths based on UMI presence, it hardcodes the "vg" aligner path. Consider making it more flexible by using the get_aligner function.

 def get_add_readgroup_input(wildcards):
+    aligner = get_aligner(wildcards)
     if sample_has_umis(wildcards.sample):
-        return "results/mapped/vg/{sample}.annotated.bam"
+        return f"results/mapped/{aligner}/{{sample}}.annotated.bam"
     else:
-        return "results/mapped/vg/{sample}.mate_fixed.bam"
+        return f"results/mapped/{aligner}/{{sample}}.mate_fixed.bam"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b1da7de and 235fc63.

📒 Files selected for processing (2)
  • workflow/rules/common.smk (7 hunks)
  • workflow/rules/ref.smk (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • workflow/rules/ref.smk
🔇 Additional comments (5)
workflow/rules/common.smk (5)

33-34: LGTM: Consistent naming convention for pangenome variables.

The new pangenome variables follow the established naming pattern and are properly constructed using f-strings.


268-269: LGTM: Proper integration of pangenome aligner selection.

The condition is correctly placed in the aligner selection logic and uses the established is_activated pattern.


459-464: LGTM: Well-structured VCF path handling for pangenome.

The function properly handles different VCF paths based on configuration and follows consistent path construction patterns.


632-637: LGTM: Consistent read group handling for vg aligner.

The function follows the established pattern for read group generation and includes all necessary fields.


708-720: LGTM: Clean implementation of indexed BCF handling.

The function properly handles the optional index parameter while maintaining backward compatibility.

@FelixMoelder FelixMoelder enabled auto-merge (squash) December 5, 2024 14:47
@FelixMoelder FelixMoelder merged commit e86cc81 into master Dec 5, 2024
8 checks passed
@FelixMoelder FelixMoelder deleted the allow-pangenome-mapping branch December 5, 2024 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants