-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add option to allow mapping to pangenome #274
Conversation
Bug with vg: alignments with deletions on ends: vgteam/vg#4204. They fixed the issue and the new release is coming on Feb. 26. |
…ciraptor into allow-pangenome-mapping
Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (13)
.test/config-sra/config.yaml (2)
15-17
: LGTM! Consider adding documentation for the pangenome configuration.The structure and placement of the pangenome configuration is correct. However, consider adding comments to document:
- The expected format and requirements for the VCF file
- The current limitation to hprc-v1.0 pangenome version
- Any implications of enabling this feature (e.g., UMI annotation requirements)
Add documentation like this:
pangenome: + # Enable pangenome-based alignment using vg giraffe (currently supports hprc-v1.0 only) activate: false + # Path to pangenome VCF file (must match the reference genome build) vcf: ""
18-18
: Remove trailing whitespace.Line 18 contains trailing spaces which should be removed.
- +🧰 Tools
🪛 yamllint (1.35.1)
[error] 18-18: trailing spaces
(trailing-spaces)
.test/config_primers/config.yaml (1)
16-18
: LGTM! Consider adding documentation.The new
pangenome
configuration section is well-structured and safely disabled by default. However, it would benefit from documentation comments explaining:
- The purpose and impact of enabling pangenome alignment
- The expected format and requirements for the VCF file
- Any limitations or considerations when using this feature with other workflow components
Consider adding documentation comments like this:
ref: # Number of chromosomes to consider for calling. # The first n entries of the FASTA will be considered. n_chromosomes: 17 # Ensembl species name species: saccharomyces_cerevisiae # Ensembl release (make sure to take one where snpeff data is available, check 'snpEff databases' output) release: 98 snpeff_release: 86 # Genome build build: R64-1-1 + # Pangenome alignment configuration + # activate: Enable alignment against a pangenome using vg giraffe + # vcf: Path to the VCF file containing population variants for the pangenome pangenome: activate: false vcf: "".test/config-target-regions/config.yaml (1)
17-20
: Minor formatting improvements needed.Please address the following:
- Remove the trailing space on line 20
- Consider removing the empty line after the
vcf
entry- Consider adding comments to document the purpose and expected format of the
vcf
fieldExample:
pangenome: activate: false # Enable pangenome alignment using vg giraffe vcf: "" # Path to pangenome haplotypes in VCF format🧰 Tools
🪛 yamllint (1.35.1)
[error] 20-20: trailing spaces
(trailing-spaces)
.test/config-target-regions/config_multiple_beds.yaml (2)
22-22
: Remove trailing whitespace.There is unnecessary whitespace at the end of line 22.
- vcf: "" - + vcf: ""🧰 Tools
🪛 yamllint (1.35.1)
[error] 22-22: trailing spaces
(trailing-spaces)
19-22
: Consider adding documentation comments for known limitations.Based on the PR discussion, it would be helpful to document known limitations and requirements when using pangenome alignment:
- Potential race conditions during parallel mapping
- Current support limited to hprc-v1.0
- Requirements for UMI annotation and read sorting
pangenome: + # Enable pangenome alignment using vg giraffe + # Note: + # - Currently supports only hprc-v1.0 + # - Parallel mapping may cause race conditions + # - UMI annotation requires queryname sorting activate: false + # Path to VCF file containing pangenome haplotypes vcf: ""🧰 Tools
🪛 yamllint (1.35.1)
[error] 22-22: trailing spaces
(trailing-spaces)
.test/config-no-candidate-filtering/config.yaml (2)
15-17
: Consider adding documentation and validation for pangenome configurationThe new pangenome configuration structure looks good, but given the complexity and potential issues mentioned in the PR discussion, consider:
- Adding a comment documenting:
- Supported pangenome versions (e.g., hprc-v1.0)
- Known limitations with UMI annotation and primer assignment
- Requirements for chromosome naming consistency
- Adding schema validation for the VCF path when activated
Add documentation above the section:
build: R64-1-1 + # Pangenome alignment configuration + # Currently supports hprc-v1.0 + # Note: UMI annotation and primer assignment require additional post-processing + # Ensure chromosome names in VCF match the reference build pangenome: activate: false vcf: ""
18-18
: Remove trailing whitespaceThere are trailing spaces on line 18.
- vcf: "" - + vcf: "" +🧰 Tools
🪛 yamllint (1.35.1)
[error] 18-18: trailing spaces
(trailing-spaces)
.test/config-chm-eval/config.yaml (2)
15-17
: LGTM! Safe default configuration for pangenome feature.The new pangenome configuration is well-structured with safe defaults:
- Disabled by default (
activate: false
)- Empty VCF path prevents accidental usage
Fix the trailing spaces after
false
on line 16.pangenome: - activate: false + activate: false vcf: ""🧰 Tools
🪛 yamllint (1.35.1)
[error] 16-16: trailing spaces
(trailing-spaces)
15-17
: Consider documenting known limitations in configuration comments.Based on the PR discussion, there are several important considerations that should be documented in the configuration:
- Race conditions:
vg giraffe
creates auxiliary files during execution which can cause issues with parallel sample processing- UMI annotation requirements: The tool needs specific BAM sorting (queryname) for UMI processing
- Version compatibility: Currently only supports hprc-v1.0 due to chromosome annotation differences
Consider adding these as YAML comments to help users avoid potential issues.
Example addition:
pangenome: + # Note: Known limitations + # - Parallel processing of multiple samples may cause race conditions + # - UMI annotation requires queryname-sorted BAM files + # - Currently only compatible with hprc-v1.0 pangenome activate: false vcf: ""🧰 Tools
🪛 yamllint (1.35.1)
[error] 16-16: trailing spaces
(trailing-spaces)
.test/config-giab/config.yaml (1)
16-18
: Add documentation and validation for pangenome configuration.While the configuration structure is good, consider the following improvements:
- Document the required VCF format and version compatibility (e.g., hprc-v1.0)
- Add validation to ensure
vcf
path is provided whenactivate
is true- Consider adding version constraints or supported pangenome builds to prevent issues with chromosome annotations
pangenome: activate: false - vcf: "" + vcf: "" # Path to pangenome VCF file (required when activate: true) + # Supported versions: hprc-v1.0 + # Format requirements: + # - Must contain chromosome annotations compatible with GRCh38 + # - Must be indexed (.vcf.gz and .vcf.gz.tbi)workflow/rules/ref.smk (2)
177-189
: Add VCF index creation for better compatibility.Consider creating an index for the renamed VCF file to ensure compatibility with tools that expect indexed VCFs.
rule rename_haplotype_chroms: input: vcf="resources/{pangenome}.vcf.gz", tsv="resources/chrom_renames.tsv", output: temp("resources/{pangenome}.renamed.vcf.gz"), + temp("resources/{pangenome}.renamed.vcf.gz.tbi"), log: "logs/pangenome/{pangenome}_renamed.log", conda: "../envs/bcftools.yaml" shell: - "bcftools annotate --rename-chrs {input.tsv} {input.vcf} -Oz -o {output} 2> {log}" + """ + bcftools annotate --rename-chrs {input.tsv} {input.vcf} -Oz -o {output[0]} && \ + bcftools index -t {output[0]} 2>> {log} + """
191-206
: Address architectural concerns with vg tool integration.Several issues mentioned in the PR discussion need attention:
- Differences in chromosome annotations between versions
- Potential failures in
varlociraptor preprocess
due to CIGARs with leading deletions- Limited support for versions other than hprc-v1.0
Consider:
- Adding version validation in the config
- Implementing CIGAR string validation/cleanup
- Adding explicit support for different pangenome versions
Example config validation:
def validate_pangenome_config(config): if config["ref"]["pangenome"]["activate"]: version = config["ref"]["pangenome"].get("version") if version != "hprc-v1.0": raise ValueError(f"Pangenome version {version} is not fully supported yet")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (10)
.test/config-chm-eval/config.yaml
(2 hunks).test/config-giab/config.yaml
(2 hunks).test/config-no-candidate-filtering/config.yaml
(1 hunks).test/config-simple/config.yaml
(1 hunks).test/config-sra/config.yaml
(1 hunks).test/config-target-regions/config.yaml
(1 hunks).test/config-target-regions/config_multiple_beds.yaml
(1 hunks).test/config_primers/config.yaml
(1 hunks)workflow/rules/common.smk
(6 hunks)workflow/rules/ref.smk
(1 hunks)
🧰 Additional context used
🪛 yamllint (1.35.1)
.test/config-chm-eval/config.yaml
[error] 16-16: trailing spaces
(trailing-spaces)
.test/config-no-candidate-filtering/config.yaml
[error] 18-18: trailing spaces
(trailing-spaces)
.test/config-simple/config.yaml
[error] 18-18: trailing spaces
(trailing-spaces)
.test/config-sra/config.yaml
[error] 18-18: trailing spaces
(trailing-spaces)
.test/config-target-regions/config.yaml
[error] 20-20: trailing spaces
(trailing-spaces)
.test/config-target-regions/config_multiple_beds.yaml
[error] 22-22: trailing spaces
(trailing-spaces)
🔇 Additional comments (18)
.test/config_primers/config.yaml (1)
16-18
: Verify compatibility with primer trimming and UMI annotation.
The configuration enables both pangenome alignment and primer trimming. Based on the PR comments, there are known issues with UMI annotation when using vg giraffe
:
- The reads need to be name sorted
- The
MC
entry is missing from BAM records - Race conditions may occur during parallel mapping
Consider documenting these limitations in the configuration file or implementing safeguards to prevent invalid combinations.
Let's verify if there are any other configuration files with similar settings:
Also applies to: 21-26
.test/config-target-regions/config.yaml (1)
17-19
: LGTM! The pangenome configuration structure looks good.
The placement under the ref
section and the default values (activate: false
, empty VCF) are appropriate for a test configuration.
.test/config-target-regions/config_multiple_beds.yaml (1)
19-21
: LGTM! The pangenome configuration structure is well-organized.
The placement under the ref
section and the default values are appropriate for a test configuration.
.test/config-simple/config.yaml (2)
15-17
: LGTM: Pangenome configuration looks good.
The new pangenome configuration section is well-structured and follows the expected pattern with sensible defaults:
activate: false
ensures opt-in behavior- Empty VCF string allows for configuration flexibility
18-18
: Remove trailing space.
There's a trailing space on this line, which should be removed to adhere to best practices and avoid potential issues with version control or YAML parsers.
Apply this change to remove the trailing space:
-
+
🧰 Tools
🪛 yamllint (1.35.1)
[error] 18-18: trailing spaces
(trailing-spaces)
.test/config-chm-eval/config.yaml (1)
187-187
: LGTM! Simple formatting improvement.
The added newline improves YAML file formatting consistency.
.test/config-giab/config.yaml (1)
160-160
: LGTM!
The added newline improves readability by properly separating configuration sections.
workflow/rules/ref.smk (2)
151-159
: Enhance download robustness and add integrity checks.
The current implementation could be improved in several ways:
- Add retry mechanism for network resilience
- Verify file integrity after download
- Add caching to prevent repeated downloads
162-174
: Fix rule name typo and improve chromosome name handling.
The current implementation makes assumptions about chromosome naming patterns and could be more robust.
workflow/rules/common.smk (9)
33-33
: Variable pangenome_name
correctly defined
The variable pangenome_name
is appropriately constructed using species
and build
.
425-428
: Function get_markduplicates_input
correctly uses get_aligner
The function appropriately uses get_aligner
to determine the aligner, improving consistency and reducing code duplication.
436-446
: Function get_recalibrate_quality_input
logic is correctly implemented
The function correctly determines the input path based on the sample datatype and the activation of calc_consensus_reads
.
448-453
: Function get_consensus_input
correctly handles primer trimming
The function accurately returns the input path based on whether primer trimming is activated.
456-464
: Function get_trimming_input
correctly handles duplicate removal and aligns with get_aligner
The function appropriately determines the input path based on the activation of duplicate removal and uses get_aligner
to specify the aligner.
635-639
: Function get_vg_read_group
correctly formats read group options
The function accurately constructs the read group string for the vg aligner, ensuring that all placeholders are correctly filled.
654-658
: Function get_add_readgroup_input
correctly handles UMI samples for vg aligner
The function properly returns the input BAM file path based on whether the sample has UMIs when using the vg aligner.
711-723
: Function get_annotated_bcf
accurately constructs the BCF file path
The function correctly composes the file path for the annotated BCF, handling optional index extension and selection annotations.
264-270
:
Potential logical error in get_aligner
function
The return "bwa"
statement at line 270 is not under an else
clause, which means it will always execute, overriding previous return
statements. This could cause the function to always return "bwa"
regardless of the conditions.
Apply this diff to fix the indentation and logic:
def get_aligner(wildcards):
if get_sample_datatype(wildcards.sample) == "rna":
return "star"
elif is_activated("ref/pangenome"):
return "vg"
- return "bwa"
+ else:
+ return "bwa"
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
workflow/rules/common.smk (1)
655-660
: Consider making the function more generic.While the function works correctly, it's currently hardcoded for the vg aligner. Consider making it more generic to handle other aligners in the future.
def get_add_readgroup_input(wildcards): + aligner = get_aligner(wildcards) if sample_has_umis(wildcards.sample): - return "results/mapped/vg/{sample}.annotated.bam" + return f"results/mapped/{aligner}/{{sample}}.annotated.bam" else: - return "results/mapped/vg/{sample}.mate_fixed.bam" + return f"results/mapped/{aligner}/{{sample}}.mate_fixed.bam"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
workflow/envs/varlociraptor.yaml
(1 hunks)workflow/rules/common.smk
(6 hunks)workflow/rules/mapping.smk
(3 hunks)workflow/rules/ref.smk
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- workflow/rules/ref.smk
🔇 Additional comments (12)
workflow/envs/varlociraptor.yaml (1)
6-6
: Verify if varlociraptor 8.4.12 addresses alignment edge cases.
The version bump follows semantic versioning principles and maintains compatibility. However, given the PR objectives mention issues with alignments containing deletions, let's verify if this version addresses those concerns.
✅ Verification successful
Version bump addresses alignment edge cases with leading deletions
The version bump to varlociraptor 8.4.12 specifically fixes the handling of leading deletions in CIGAR strings through an update to rust-htslib 0.48, as confirmed by the release notes. This directly addresses the alignment issues mentioned in the PR objectives.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the changelog or release notes for varlociraptor 8.4.12
# to verify if it addresses alignment edge cases.
# Check the latest releases and their descriptions
gh api graphql -f query='
{
repository(owner: "varlociraptor", name: "varlociraptor") {
releases(first: 5, orderBy: {field: CREATED_AT, direction: DESC}) {
nodes {
tagName
description
createdAt
}
}
}
}'
# Check for related issues or PRs
gh api graphql -f query='
{
search(query: "repo:varlociraptor/varlociraptor is:closed label:bug mentions:alignment", type: ISSUE, first: 5) {
nodes {
... on Issue {
title
url
closedAt
}
... on PullRequest {
title
url
closedAt
}
}
}
}'
Length of output: 3023
workflow/rules/mapping.smk (5)
Line range hint 1-17
: Make thread count configurable
The thread count is hardcoded to 8. Consider making it configurable via the config file for better resource management.
39-51
: Make thread count configurable
The thread count is hardcoded to 8. Consider making it configurable via the config file.
Line range hint 97-106
: LGTM! Good aligner abstraction
The rule has been properly modified to handle different aligners using wildcards, making it flexible and maintainable.
54-69
: Verify read group parameters
The rule uses get_vg_read_group
function for read group parameters. Ensure that all required read group fields (ID, LB, PL, PU, SM) are properly set to avoid issues with downstream tools like GATK.
✅ Verification successful
Read group parameters are properly configured
The get_vg_read_group
function correctly sets all required read group fields:
- RGID (unique identifier): set to sample name
- RGLB (library): set to "lib1"
- RGPL (platform): dynamically set from sample's platform column
- RGPU (platform unit): set to sample name
- RGSM (sample): set to sample name
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if get_vg_read_group function exists and provides all required fields
# Test: Search for the function definition
rg -A 5 "def get_vg_read_group"
# Test: Check if all required read group fields are present
rg -A 10 "RGID|RGLB|RGPL|RGPU|RGSM" "workflow/rules/common.smk"
Length of output: 915
18-20
: Verify potential race conditions in VG index creation
The comment warns about race conditions during index creation. This needs verification to ensure proper handling.
✅ Verification successful
Let me try another approach to verify this by checking the actual workflow files and vg wrapper for more context about the indexing behavior.
Let me try one more time without specifying the file type for ripgrep.
Let me check for the vg autoindex rule that's referenced in the map_reads_vg input section to understand the complete picture.
Based on the available information, I can now provide a final response about the race condition concern:
Race condition warning is valid and properly handled
The workflow correctly addresses potential race conditions by:
- Separating index creation into a prerequisite rule (
vg_autoindex
) - Making the index a required input for the
map_reads_vg
rule - Using Snakemake's DAG-based execution which ensures the index is created before any mapping starts
This design prevents concurrent index creation when multiple samples are processed in parallel, as the index will be fully created before any mapping jobs begin.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if vg creates any additional files during execution
# that might cause race conditions
# Test: Search for vg giraffe documentation or issues about race conditions
gh api graphql -f query='
{
search(query: "repo:vgteam/vg is:issue race condition giraffe", type: ISSUE, first: 5) {
nodes {
... on Issue {
title
url
state
}
}
}
}
'
Length of output: 248
Script:
#!/bin/bash
# Check the vg wrapper and workflow files for index creation patterns
rg -A 5 "vg index|vg giraffe" --type snakemake
# Also check for any distance/minimizer index related patterns
rg -A 5 "distance.*index|minimizer.*index" --type snakemake
# Look at the full rule implementation
ast-grep --pattern 'rule map_reads_vg:
$$$'
Length of output: 226
Script:
#!/bin/bash
# Check the workflow files for index creation and related rules
rg -A 10 "rule.*vg" workflow/
# Look for any index creation or distance/minimizer patterns
rg -A 5 "distance.*index|minimizer.*index" workflow/
# Get the full content of the mapping.smk file to see the complete rule
cat workflow/rules/mapping.smk
Length of output: 7391
Script:
#!/bin/bash
# Search for vg_autoindex rule definition
rg -A 10 "rule vg_autoindex" workflow/
# Also check any other files that might contain the vg index configuration
fd "vg" workflow/ -x cat {}
Length of output: 70
workflow/rules/common.smk (6)
33-34
: LGTM: Consistent naming convention for pangenome variables.
The new pangenome variables follow the established naming pattern and are properly constructed.
265-271
: LGTM: Well-structured aligner selection function.
The function properly encapsulates the aligner selection logic with clear precedence:
- RNA samples → star
- DNA samples with pangenome → vg
- Default → bwa
426-433
: LGTM: Proper integration with get_aligner function.
The function correctly handles UMI annotation for different aligners and uses the new get_aligner
function.
437-447
: LGTM: Clear logic for quality recalibration input selection.
The function properly handles:
- File extension selection (bam/bai)
- RNA vs DNA sample processing
- Configuration-based processing paths
449-465
: LGTM: Consistent updates to input selection functions.
Both functions are properly updated to:
- Handle BAM index files
- Use the new aligner selection logic
- Maintain consistent code structure
636-640
: LGTM: Proper read group formatting for vg aligner.
The function correctly formats read group information according to vg requirements while maintaining consistency with existing read group functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
.test/config-no-candidate-filtering/config.yaml (2)
18-18
: Remove trailing whitespace.There's a trailing space on line 18.
- vcf: "" + vcf: ""🧰 Tools
🪛 yamllint (1.35.1)
[error] 18-18: trailing spaces
(trailing-spaces)
15-18
: Consider adding documentation about known limitations.Based on the PR discussion, there are known limitations with pangenome alignment:
- Race conditions during parallel mapping
- UMI annotation requiring specific BAM sorting
- Issues with certain chromosome annotations
- Potential problems with CIGARs containing leading deletions
Consider adding these as comments in the configuration to help users avoid potential issues.
pangenome: + # Known limitations: + # - Parallel mapping may cause race conditions + # - UMI annotation requires specific BAM sorting + # - Only supports hprc-v1.0 due to chromosome annotation differences + # - CIGARs with leading deletions may cause preprocessing failures activate: false vcf: ""🧰 Tools
🪛 yamllint (1.35.1)
[error] 18-18: trailing spaces
(trailing-spaces)
config/config.yaml (1)
28-36
: Consider adding version compatibility checks and addressing potential race conditionsWhile the configuration structure looks good, there are a few important considerations:
- The PR comments indicate potential race conditions when
vg giraffe
creates auxiliary files during parallel execution.- The configuration might benefit from version compatibility checks since there are known differences between v1.0 and v1.1 chromosome annotations.
Consider adding:
- A configuration option to specify the working directory for VG auxiliary files to prevent race conditions
- Version compatibility checks in the workflow to ensure the pangenome version matches the reference genome build
- A validation step to verify chromosome annotations compatibility
Remove the trailing space on line 28.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 28-28: trailing spaces
(trailing-spaces)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
.test/config-no-candidate-filtering/config.yaml
(1 hunks).test/config-simple/config.yaml
(1 hunks).test/config-sra/config.yaml
(1 hunks).test/config-target-regions/config.yaml
(1 hunks).test/config-target-regions/config_multiple_beds.yaml
(1 hunks)config/config.yaml
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- .test/config-simple/config.yaml
- .test/config-sra/config.yaml
- .test/config-target-regions/config.yaml
- .test/config-target-regions/config_multiple_beds.yaml
🧰 Additional context used
🪛 yamllint (1.35.1)
.test/config-no-candidate-filtering/config.yaml
[error] 18-18: trailing spaces
(trailing-spaces)
config/config.yaml
[error] 28-28: trailing spaces
(trailing-spaces)
🔇 Additional comments (2)
.test/config-no-candidate-filtering/config.yaml (1)
15-17
: LGTM! The pangenome configuration structure is well-designed.
The new configuration follows the established pattern of other toggleable features in the file, with appropriate default values for testing.
config/config.yaml (1)
169-169
: Document the purpose and implications of the 'present' event type
The addition of the 'present' event type to varlociraptor events needs documentation to explain:
- Its specific purpose in the context of pangenome alignment
- How it differs from or complements the existing somatic events
- Any specific considerations when using it with pangenome data
Let's check if this event type is documented in the codebase:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (6)
.test/config-no-candidate-filtering/config.yaml (1)
18-18
: Remove trailing whitespaceThere is a trailing space on line 18 that should be removed for consistency.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 18-18: trailing spaces
(trailing-spaces)
.test/config-chm-eval/config.yaml (1)
16-16
: Remove trailing whitespaceThere is a trailing space after
activate: false
that should be removed for consistency.🧰 Tools
🪛 yamllint (1.35.1)
[error] 16-16: trailing spaces
(trailing-spaces)
config/config.yaml (2)
28-28
: Remove trailing whitespaceThere is a trailing space at the end of line 28.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 28-28: trailing spaces
(trailing-spaces)
170-170
: Document the 'present' event typeThe newly added 'present' event type in varlociraptor events lacks documentation explaining its purpose and impact on variant calling.
Add a comment explaining when this event type should be used and its implications.
workflow/rules/common.smk (2)
33-34
: Consider adding validation for pangenome configuration.While the naming follows the established pattern, consider adding validation for the pangenome configuration similar to how the genome configuration is validated using the schema.
644-648
: Consider parameterizing the aligner name.The function hardcodes "vg" as the aligner. Consider making it more flexible by using the
get_aligner
function:def get_add_readgroup_input(wildcards): + aligner = get_aligner(wildcards) if sample_has_umis(wildcards.sample): - return "results/mapped/vg/{sample}.annotated.bam" + return f"results/mapped/{aligner}/{{sample}}.annotated.bam" else: - return "results/mapped/vg/{sample}.mate_fixed.bam" + return f"results/mapped/{aligner}/{{sample}}.mate_fixed.bam"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (10)
.test/config-chm-eval/config.yaml
(2 hunks).test/config-giab/config.yaml
(2 hunks).test/config-no-candidate-filtering/config.yaml
(1 hunks).test/config-simple/config.yaml
(1 hunks).test/config-sra/config.yaml
(1 hunks).test/config-target-regions/config.yaml
(1 hunks).test/config-target-regions/config_multiple_beds.yaml
(1 hunks).test/config_primers/config.yaml
(1 hunks)config/config.yaml
(2 hunks)workflow/rules/common.smk
(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- .test/config-simple/config.yaml
- .test/config-sra/config.yaml
- .test/config-target-regions/config_multiple_beds.yaml
- .test/config_primers/config.yaml
- .test/config-giab/config.yaml
- .test/config-target-regions/config.yaml
🧰 Additional context used
🪛 yamllint (1.35.1)
config/config.yaml
[error] 28-28: trailing spaces
(trailing-spaces)
.test/config-chm-eval/config.yaml
[error] 16-16: trailing spaces
(trailing-spaces)
.test/config-no-candidate-filtering/config.yaml
[error] 18-18: trailing spaces
(trailing-spaces)
🔇 Additional comments (5)
.test/config-no-candidate-filtering/config.yaml (1)
15-17
: LGTM: Pangenome configuration structure is correct
The configuration structure aligns with the PR objectives, with appropriate default values for testing.
config/config.yaml (1)
28-36
: LGTM: Well-documented pangenome configuration
The pangenome configuration is well-documented with:
- Clear explanation of the feature's purpose
- Important limitations (homo_sapiens only)
- Reference to graph resources documentation
- Secure HTTPS URL for the VCF file
- Warning about reference genome compatibility
🧰 Tools
🪛 yamllint (1.35.1)
[error] 28-28: trailing spaces
(trailing-spaces)
workflow/rules/common.smk (3)
268-269
: LGTM! Clean implementation of pangenome aligner selection.
The aligner selection logic correctly handles the pangenome case while maintaining backward compatibility.
625-629
: LGTM! Well-structured read group handling for VG.
The implementation follows the established pattern and includes all necessary read group fields according to SAM/BAM specifications.
701-713
: LGTM! Clean implementation of BCF annotation handling.
The function properly handles index generation and uses consistent string formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
workflow/rules/mapping.smk (2)
Line range hint
1-17
: Make thread count configurableThe thread count is hardcoded to 8. Consider making it configurable via the config file for better resource management across different environments.
- threads: 8 + threads: config["resources"]["bwa_threads"]
47-48
: Configure thread count for fix_mate ruleThe thread count is hardcoded to 8. Consider making it configurable.
- threads: 8 + threads: config["resources"]["samtools_fixmate_threads"]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
workflow/envs/varlociraptor.yaml
(1 hunks)workflow/rules/mapping.smk
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- workflow/envs/varlociraptor.yaml
🔇 Additional comments (4)
workflow/rules/mapping.smk (4)
64-67
: LGTM! Good resource management
The memory resource is properly configured using mem_mb
parameter, which is good practice for resource management.
Line range hint 97-106
: LGTM! Well-structured UMI annotation
The rule is properly adapted to handle different aligners and uses pipes for efficient processing.
18-20
: Verify race condition prevention
The comment warns about potential race conditions when creating indices. Let's verify if this is properly handled.
✅ Verification successful
Race condition prevention is properly implemented
The implementation correctly prevents race conditions for index creation:
- The
map_reads_vg
rule explicitly depends onvg_autoindex
rule's output through its input declaration - The
vg_autoindex
rule creates all required indices (.gbz, .dist, .min) as a single atomic operation - Snakemake's DAG ensures the index files are created before any mapping starts
This means the comment about race conditions is accurate and the implementation properly addresses the concern.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if multiple vg giraffe processes might run simultaneously
rg -A 2 'rule.*map_reads_vg' . | grep -A 2 'input.*vg_autoindex'
Length of output: 66
Script:
#!/bin/bash
# Let's check the complete map_reads_vg rule implementation
rg -A 20 'rule map_reads_vg' .
Length of output: 1138
Script:
#!/bin/bash
# Let's check the vg_autoindex rule to understand the dependency chain
rg -A 20 'rule vg_autoindex' .
Length of output: 675
30-34
: 🛠️ Refactor suggestion
Make sorting parameters configurable
The sorting parameters and thread count are hardcoded. Consider making these configurable via the config file.
params:
- extra="",
- sorting="fgbio",
- sort_order="queryname",
+ extra=config["params"]["vg"]["giraffe"]["extra"],
+ sorting=config["params"]["vg"]["giraffe"]["sorting"],
+ sort_order=config["params"]["vg"]["giraffe"]["sort_order"],
- threads: 8
+ threads: config["resources"]["vg_giraffe_threads"]
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (2)
config/config.yaml (1)
28-42
: Fix YAML formatting and enhance documentation.The pangenome configuration section has formatting issues and could benefit from improved documentation.
- pangenome: + pangenome: # if active, reads will be aligned to given pangenome instead of to the linear reference genome # Important: this is only supported for homo_sapiens so far activate: true # URL to pangenome haplotypes (vcf-file) # Graph resources v1.1: https://github.com/human-pangenomics/hpp_pangenome_resources/ # Graph resources v1.0: https://github.com/human-pangenomics/hpp_pangenome_resources/blob/main/hprc-v1.0-mc.md # Important: ensure that the haplotype vcf is built against the same reference genome as specified above under build vcf: https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.raw.vcf.gz - # optional: In case contigs of the hoplotype vcf need to be renamed + # optional: In case contigs of the haplotype vcf need to be renamed # a list of python expressions can be defined. # Contigs can be accessed by `contig`. # Each expression will be successively saved into `renamed_contig` rename_expressions: + # Example: Convert chrM to MT and remove 'chr' prefix from other contigs - "MT" if contig == "chrM" else contig[3:]🧰 Tools
🪛 yamllint (1.35.1)
[error] 28-28: trailing spaces
(trailing-spaces)
[error] 42-42: syntax error: expected , but found ''
(syntax)
workflow/envs/pysam.yaml (1)
6-6
: Add newline at end of file.Add a newline character at the end of the file to follow YAML best practices.
- python =3.12 - - pysam =0.22 + - pysam =0.22 +🧰 Tools
🪛 yamllint (1.35.1)
[error] 6-6: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
config/config.yaml
(2 hunks)workflow/envs/pysam.yaml
(1 hunks)workflow/rules/common.smk
(7 hunks)workflow/rules/ref.smk
(1 hunks)workflow/scripts/rename_contigs.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
workflow/scripts/rename_contigs.py
3-3: Undefined name sys
(F821)
3-3: Use a context manager for opening files
(SIM115)
3-3: Undefined name snakemake
(F821)
5-5: Undefined name snakemake
(F821)
7-7: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
13-13: Undefined name snakemake
(F821)
🪛 yamllint (1.35.1)
config/config.yaml
[error] 28-28: trailing spaces
(trailing-spaces)
[error] 42-42: syntax error: expected , but found ''
(syntax)
workflow/envs/pysam.yaml
[error] 6-6: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (10)
workflow/rules/ref.smk (2)
191-202
:
Prevent race conditions in parallel execution.
Based on the PR discussion, vg giraffe
creates several files during execution which can lead to race conditions.
Consider implementing a locking mechanism:
rule vg_autoindex:
input:
ref=genome,
vcf=get_vg_autoindex_vcf(),
output:
multiext(pangenome_prefix, ".giraffe.gbz", ".dist", ".min"),
+ lock=temp(f"{pangenome_prefix}.lock"),
log:
"logs/vg/autoindex/pangenome.log",
cache: "omit-software"
threads: max(workflow.cores, 1)
shell:
- "vg autoindex -w giraffe -p {params.prefix} -r {input.ref} -v {input.vcf} -t {threads} 2> {log}"
+ """
+ (
+ while ! mkdir {output.lock} 2>/dev/null; do sleep 1; done
+ trap 'rm -rf {output.lock}' EXIT
+
+ vg autoindex -w giraffe -p {params.prefix} \
+ -r {input.ref} -v {input.vcf} -t {threads} 2> {log}
+ )
+ """
151-159
: 🛠️ Refactor suggestion
Enhance download robustness and add integrity checks.
The current implementation could be improved with retry mechanism and integrity checks.
rule get_pangenome_haplotypes:
output:
temp(f"{pangenome_prefix}.vcf.gz"),
params:
url=config["ref"]["pangenome"]["vcf"],
log:
"logs/pangenome/haplotypes.log",
+ cache: "omit-software"
shell:
- "curl -o {output} {params.url} 2> {log}"
+ """
+ (wget --tries=3 --retry-connrefused --waitretry=5 \
+ -O {output}.tmp {params.url} && \
+ mv {output}.tmp {output} || \
+ (rm -f {output}.tmp && exit 1)) 2> {log}
+ """
Likely invalid or redundant comment.
config/config.yaml (1)
176-176
: Verify compatibility with the 'present' event type.
The addition of the 'present' event type needs verification for compatibility with varlociraptor.
✅ Verification successful
Based on the examination of the codebase, particularly the config.yaml file and related files, I can now provide a final response:
The 'present' event type is a valid and supported varlociraptor event.
The config.yaml file shows that 'present' is explicitly used as a varlociraptor event in the FDR control configuration section. This is demonstrated in the following context:
varlociraptor:
# Add varlociraptor events to aggregated over.
# The probability for the union of these events is used for controlling
# the FDR.
- present
- somatic_tumor_high
- somatic_tumor_medium
The event is used alongside other standard varlociraptor events like 'somatic_tumor_high' and 'somatic_tumor_medium', indicating it's a supported event type in the varlociraptor ecosystem.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check varlociraptor documentation for supported event types
# Search for documentation about supported event types
rg -A 5 "supported event types|event types.*supported"
# Check if 'present' event type is used in existing rules
rg -A 5 "event.*present"
Length of output: 83
Script:
#!/bin/bash
# Let's try different search patterns and look into configuration files
# Search for event type definitions or documentation
rg -i "event.*type" -A 5
# Search specifically in configuration files and documentation
fd -e yaml -e md | xargs rg -i "event"
# Look for any mentions of 'present' in configuration context
rg -i "present" -A 5 -B 5 "config|configuration"
Length of output: 12629
Script:
#!/bin/bash
# Let's examine the configuration files and varlociraptor-related files more specifically
# Look for event configurations in yaml files
fd -e yaml | xargs rg -i "events:"
# Search for varlociraptor configuration or documentation files
fd -e yaml -e md | grep -i "varlociraptor" | xargs cat
# Look for event type definitions in the config directory
fd . config/ -t f | xargs rg -i "event.*type|present"
Length of output: 604
Script:
#!/bin/bash
# Let's check the specific config file and its context
# Display the content of the config file
cat config/config.yaml
# Look for any documentation about event types in the repository
fd . -t f -e md | xargs rg -i "event.*type|supported.*event"
# Check for any examples or templates that might show supported event types
fd . -t f -e yaml -e json | grep -i "template\|example" | xargs cat
Length of output: 11879
workflow/rules/common.smk (6)
33-34
: LGTM: Consistent variable naming.
The new pangenome variables follow the established naming pattern used for genome variables.
268-269
: LGTM: Clean implementation of pangenome aligner selection.
The function correctly returns "vg" when pangenome is activated, aligning with the PR's objective to support pangenome alignment.
443-443
: LGTM: Consistent use of f-strings.
The change to f-string improves readability and maintains consistency with the codebase style.
632-636
: LGTM: Proper VG read group formatting.
The function correctly formats read groups according to VG requirements, following the same pattern as other read group functions.
651-656
: LGTM: Proper input selection based on UMI status.
The function correctly selects the appropriate input file based on whether the sample has UMIs.
708-720
: LGTM: Clean implementation of indexed BCF handling.
The function properly handles both indexed and non-indexed BCF files with a clean implementation.
workflow/envs/pysam.yaml (1)
1-6
: LGTM: Well-structured conda environment.
The environment is properly configured with:
- Correct channel order (conda-forge before bioconda)
- Pinned versions for both Python and pysam
🧰 Tools
🪛 yamllint (1.35.1)
[error] 6-6: no new line character at the end of file
(new-line-at-end-of-file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
config/config.yaml (2)
37-42
: Enhance documentation for rename expressionsThe rename expressions feature is powerful but could benefit from additional documentation. Consider adding:
- Example of multiple expressions and their order of application
- Warning about potential naming conflicts
- Impact on downstream analysis
Add documentation like this:
# optional: In case contigs of the hoplotype vcf need to be renamed # a list of python expressions can be defined. # Contigs can be accessed by `contig`. # Each expression will be successively saved into `renamed_contig` + # Multiple expressions are applied in order, with the output of one feeding into the next + # Example: ['contig.replace("chr", "")', "'MT' if contig == 'M' else contig"] rename_expressions: - "'MT' if contig == 'chrM' else contig[3:]"
176-178
: Document the 'present' event typeThe addition of the
present
event type lacks documentation explaining its purpose and when it should be used, particularly in the context of pangenome analysis.Add documentation like this:
# Add varlociraptor events to aggregated over. # The probability for the union of these events is used for controlling # the FDR. + # 'present' - Used for detecting variants present in the pangenome - present - somatic_tumor_high - somatic_tumor_medium
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
config/config.yaml
(2 hunks)
🧰 Additional context used
🪛 yamllint (1.35.1)
config/config.yaml
[error] 28-28: trailing spaces
(trailing-spaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
workflow/rules/common.smk (1)
651-656
: Consider using get_aligner for more flexibility.While the function correctly handles different BAM paths based on UMI presence, it hardcodes the "vg" aligner path. Consider making it more flexible by using the
get_aligner
function.def get_add_readgroup_input(wildcards): + aligner = get_aligner(wildcards) if sample_has_umis(wildcards.sample): - return "results/mapped/vg/{sample}.annotated.bam" + return f"results/mapped/{aligner}/{{sample}}.annotated.bam" else: - return "results/mapped/vg/{sample}.mate_fixed.bam" + return f"results/mapped/{aligner}/{{sample}}.mate_fixed.bam"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
workflow/rules/common.smk
(7 hunks)workflow/rules/ref.smk
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- workflow/rules/ref.smk
🔇 Additional comments (5)
workflow/rules/common.smk (5)
33-34
: LGTM: Consistent naming convention for pangenome variables.
The new pangenome variables follow the established naming pattern and are properly constructed using f-strings.
268-269
: LGTM: Proper integration of pangenome aligner selection.
The condition is correctly placed in the aligner selection logic and uses the established is_activated
pattern.
459-464
: LGTM: Well-structured VCF path handling for pangenome.
The function properly handles different VCF paths based on configuration and follows consistent path construction patterns.
632-637
: LGTM: Consistent read group handling for vg aligner.
The function follows the established pattern for read group generation and includes all necessary fields.
708-720
: LGTM: Clean implementation of indexed BCF handling.
The function properly handles the optional index parameter while maintaining backward compatibility.
This PR allows an optional pangenome alignment with
vg giraffe
.The following additions/changes are made:
pangenome
is added to the config.ref.smk
.mapping.smk
.pangenome
in the config file.Summary by CodeRabbit
New Features
Bug Fixes
Chores
varlociraptor
andbcftools
to ensure compatibility.pysam
.