-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hg38 support #6
base: master
Are you sure you want to change the base?
hg38 support #6
Conversation
…tralized in GlobalAppConfig. Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
…kflow readmes. Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
WalkthroughThis update significantly refines the SOPHIA tool's codebase, improving code clarity, maintainability, and functionality. It introduces comprehensive changes across various components, including enhanced support for chromosome naming conventions, refined data types, and expanded testing. The build system is now more robust, with updated compiler configurations and the introduction of new binaries, demonstrating a concerted effort to address previous limitations and user feedback. Changes
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 23
Configuration used: CodeRabbit UI
Files selected for processing (33)
- .gitignore (1 hunks)
- README.md (2 hunks)
- Release_sophia/build-sophia.sh (2 hunks)
- Release_sophiaAnnotate/build-sophiaAnnotate.sh (3 hunks)
- Release_sophiaMref/build-sophiaMref.sh (1 hunks)
- include/Alignment.h (1 hunks)
- include/Breakpoint.h (2 hunks)
- include/BreakpointReduced.h (1 hunks)
- include/ChrConverter.h (2 hunks)
- include/GlobalAppConfig.h (1 hunks)
- include/Hg37ChrConverter.h (1 hunks)
- include/Hg38ChrConverter.h (1 hunks)
- include/MasterRefProcessor.h (2 hunks)
- include/MrefEntry.h (2 hunks)
- src/Alignment.cpp (5 hunks)
- src/AnnotationProcessor.cpp (8 hunks)
- src/Breakpoint.cpp (5 hunks)
- src/BreakpointReduced.cpp (3 hunks)
- src/ChrConverter.cpp (2 hunks)
- src/GlobalAppConfig.cpp (1 hunks)
- src/HelperFunctions.cpp (1 hunks)
- src/Hg37ChrConverter.cpp (1 hunks)
- src/Hg38ChrConverter.cpp (1 hunks)
- src/IndexTable.cpp (1 hunks)
- src/MasterRefProcessor.cpp (5 hunks)
- src/MrefEntry.cpp (3 hunks)
- src/MrefEntryAnno.cpp (1 hunks)
- src/SuppAlignment.cpp (10 hunks)
- src/SuppAlignmentAnno.cpp (3 hunks)
- src/SvEvent.cpp (5 hunks)
- src/sophia.cpp (2 hunks)
- src/sophiaAnnotate.cpp (5 hunks)
- src/sophiaMref.cpp (1 hunks)
Files skipped from review due to trivial changes (3)
- .gitignore
- Release_sophiaMref/build-sophiaMref.sh
- src/MrefEntryAnno.cpp
Additional comments: 116
README.md (2)
44-44: The statement about the impossibility of installing glibc and boost static libraries with Conda might be outdated or incorrect, as Conda packages for these libraries do exist. Consider verifying this information and updating the README accordingly to avoid confusion.
58-65: The changes listed for version 35.1.0 are well-documented, including the new support for hg38, the addition of the
--assemblyname
option, and the improvements to the build script and code readability. The warning about hg38 support not being extensively tested is important and should be highlighted to users.Release_sophia/build-sophia.sh (3)
17-17: The update to the C++ standard from
c++1z
toc++17
in theCPP_OPTS
variable is a good move for ensuring modern language features and better compatibility.26-29: The addition of
GlobalAppConfig.o
,Hg37ChrConverter.o
, andHg38ChrConverter.o
to the compilation and linking process is consistent with the new class implementations and the centralization of theChrConverter
within theGlobalAppConfig
singleton.36-49: The final linking command correctly includes the new object files, ensuring that the new
ChrConverter
subclasses and theGlobalAppConfig
are part of thesophia
executable.Release_sophiaAnnotate/build-sophiaAnnotate.sh (3)
16-16: The update to the
CPP_OPTS
variable to use C++17 standard is consistent with the PR objectives to modernize the codebase.25-28: The addition of new source files
GlobalAppConfig.cpp
,Hg37ChrConverter.cpp
, andHg38ChrConverter.cpp
to the build script aligns with the PR's goal to support the hg38 human genome reference.40-58: The linking command has been correctly updated to include the new object files, ensuring that the new classes are part of the final
sophiaAnnotate
executable.include/Alignment.h (6)
46-47: The addition of the
continueConstruction
method to theAlignment
class is consistent with the PR objectives and the AI-generated summary.49-54: Reformatting of static integer declarations improves readability.
133-204: New private methods and member variables have been added to the
Alignment
class, which are likely to support the new functionality for genome reference handling. Ensure that these methods are consistent with the class's existing logic and follow best practices.112-112: The change in the method signature of
addSupplementaryAlignments
to take aconst vector<SuppAlignment> &
is consistent with modern C++ practices to avoid unnecessary copies.150-154: Verify that the removal of logic within the
fullMedianQuality
andgetMedian
methods does not negatively affect functionality and that any removed logic is either no longer needed or has been refactored elsewhere.43-209: > Note: This review was outside the patches, and no patch overlapping with it was found. Original lines [1-26]
The licensing information and author details at the top of the file are consistent with the AI-generated summary, indicating a focus on code documentation and proper attribution.
include/Breakpoint.h (1)
- 31-31: The inclusion of
ChrConverter.h
is consistent with the PR's objective to centralize chromosome conversion functionality.include/BreakpointReduced.h (1)
- 32-32: The inclusion of "ChrConverter.h" suggests that
BreakpointReduced
will now utilize theChrConverter
functionality. Ensure that the implementation withinBreakpointReduced
correctly uses theChrConverter
methods and that any necessary object instantiation aligns with the new centralized approach viaGlobalAppConfig
.include/ChrConverter.h (2)
19-78: The refactoring of the
ChrConverter
class aligns with the PR objectives to support different human genome references. The changes from static arrays to vectors and the introduction of a protected constructor and virtual destructor are significant improvements for flexibility and subclassing. However, theassembly_name
being static might be problematic if different instances are intended to work with different assemblies. Please verify that this aligns with the intended design.29-29: Moving the
using namespace std;
statement inside thesophia
namespace is a good practice to avoid polluting the global namespace. This change should not affect the functionality but improves the maintainability of the code.include/GlobalAppConfig.h (4)
41-46: The constructor
GlobalAppConfig(unique_ptr<ChrConverter const> chrConverter)
and the member variableconst unique_ptr<ChrConverter const> chrConverter
correctly establish ownership and constness of theChrConverter
object. This is a good practice for memory management and ensuring that theChrConverter
instance cannot be modified through theGlobalAppConfig
singleton.52-56: The copy constructor and assignment operator are correctly deleted to enforce the singleton pattern, ensuring that
GlobalAppConfig
cannot be copied or assigned, which could lead to multiple instances or state inconsistencies.58-62:
The implementation of the
GlobalAppConfig::init
andGlobalAppConfig::getInstance
methods appears to be correct with respect to thread safety and singleton pattern best practices. Theinit
method properly uses alock_guard
with amutex
to prevent race conditions during the creation of the singleton instance, and thegetInstance
method includes a check to ensure the instance has been initialized before it is accessed.Analysis chain
The
init
andgetInstance
methods are provided for creating and accessing the singleton instance. Ensure that the implementation of these methods usesstd::mutex
correctly to prevent race conditions during the creation of the singleton instance.* 1-18: The licensing information and author details have been added to the file header, which is important for copyright and licensing compliance. This aligns with the PR objectives of improving code documentation and proper attribution.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for the implementation of `GlobalAppConfig::init` and `GlobalAppConfig::getInstance` to ensure proper use of mutex for thread safety. rg "GlobalAppConfig::init" -- -C10 rg "GlobalAppConfig::getInstance" -- -C10Length of output: 166
Script:
#!/bin/bash # Correct the script to properly search for the implementation of `GlobalAppConfig::init` and `GlobalAppConfig::getInstance`. rg -C10 "GlobalAppConfig::init" rg -C10 "GlobalAppConfig::getInstance"Length of output: 36553
include/Hg37ChrConverter.h (1)
- 34-34: Verify that the static member
assembly_name
is defined in the corresponding implementation file to ensure linkage.include/Hg38ChrConverter.h (3)
1-17: The licensing information and author details have been added to the header file, which is good practice for code documentation and proper attribution.
31-40: The
Hg38ChrConverter
class declaration appears to be correct and follows standard C++ practices for class definitions. However, ensure that the implementation of the constructor andreadChromosomeIndex
method is correct and efficient.19-44: The header guards are correctly used to prevent multiple inclusions of the header file.
include/MasterRefProcessor.h (1)
- 29-29: The addition of
#include "ChrConverter.h"
is consistent with the PR's objective to centralize chromosome conversion functionality. Ensure that the corresponding implementation files are updated to reflect this change if necessary.include/MrefEntry.h (3)
30-30: The addition of
#include "ChrConverter.h"
is noted. Ensure that theChrConverter
class or its functionality is being utilized withinMrefEntry.h
or by its clients, otherwise, this include directive may be unnecessary.43-43: The insertion of a blank line at line 43 is a minor style change and does not affect functionality.
27-33:
Based on the search results,
MrefEntry
does interact withChrConverter
indirectly throughGlobalAppConfig
. This is in line with the PR objectives to centralizeChrConverter
within theGlobalAppConfig
singleton. The inclusion ofGlobalAppConfig.h
inMrefEntry.cpp
and the use ofGlobalAppConfig
to obtain an instance ofChrConverter
confirm this interaction.Analysis chain
Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [27-46]
Based on the PR objectives to support the hg38 human genome reference and centralize
ChrConverter
, it is surprising not to see any changes in theMrefEntry
class that utilize theChrConverter
or relate to the new functionality. IfMrefEntry
is expected to interact withChrConverter
, ensure that the necessary changes are included and that they are consistent with the rest of the codebase.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if MrefEntry interacts with ChrConverter in any way ast-grep --lang cpp --pattern $'class MrefEntry { $$$ $_(ChrConverter $_) { $$$ } $$$ }'Length of output: 446
Script:
#!/bin/bash # Check if MrefEntry uses GlobalAppConfig in any way ast-grep --lang cpp --pattern $'class MrefEntry { $$$ $_(GlobalAppConfig $_) { $$$ } $$$ }' # Check for any other changes in MrefEntry that might relate to the new functionality ast-grep --lang cpp --pattern $'class MrefEntry { $$$ }'Length of output: 854
Script:
#!/bin/bash # Search for any references to GlobalAppConfig or ChrConverter within the MrefEntry class ast-grep --lang cpp --pattern $'class MrefEntry { $$$ $_(GlobalAppConfig) { $$$ } $$$ $_(ChrConverter) { $$$ } $$$ }' # Search for any references to GlobalAppConfig or ChrConverter within the entire codebase rg "GlobalAppConfig" --type cpp rg "ChrConverter" --type cppLength of output: 13541
src/Alignment.cpp (6)
27-27: The addition of
#include "GlobalAppConfig.h"
is consistent with the PR objectives to centralize theChrConverter
within theGlobalAppConfig
singleton.39-66: The initialization of class member variables in the
Alignment
constructor has been changed to direct initialization. Ensure that this change is consistent with the coding standards and guidelines of the project.75-77: <!--Monologue:
The output from the script shows thatGlobalAppConfig::init
is indeed used to initialize theGlobalAppConfig
singleton. This initialization is present in thesophiaMref.cpp
,sophiaAnnotate.cpp
, andsophia.cpp
files, which are likely entry points for different functionalities or tools within the software suite. The presence of this initialization in these files suggests that theGlobalAppConfig
is set up early in the execution of the program, before any processing that involvesAlignment
objects begins.Additionally, the
GlobalAppConfig.cpp
file shows the implementation of theinit
method, which confirms that theGlobalAppConfig
instance is created with the providedChrConverter
. This further supports the idea that theGlobalAppConfig
is properly initialized before its instance is accessed.Given this information, it seems reasonable to conclude that the
GlobalAppConfig
is initialized before theAlignment
constructor is called, as the initialization appears to be part of the setup process in the main execution paths of the software. Therefore, the concern about potential null reference issues due to theGlobalAppConfig
not being initialized before theAlignment
constructor is called seems to be addressed by the initialization pattern found in the main files.
-->The initialization of
GlobalAppConfig
is confirmed in the main execution files (sophiaMref.cpp
,sophiaAnnotate.cpp
, andsophia.cpp
), ensuring that it is set up early in the program's execution before anyAlignment
objects are processed. This addresses the concern about potential null reference issues.Analysis chain
The
Alignment
constructor now uses theGlobalAppConfig
singleton to access theChrConverter
. Verify that theGlobalAppConfig
is properly initialized before this constructor is called to prevent potential null reference issues.* 176-177:Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for the initialization of GlobalAppConfig before any instance of Alignment is created. rg 'GlobalAppConfig::getInstance\(\)' -- -A 3 -B 3Length of output: 222
Script:
#!/bin/bash # Search for the usage of GlobalAppConfig::getInstance() and provide 5 lines of context before and after the match. rg 'GlobalAppConfig::getInstance\(\)' -C 5Length of output: 17052
Script:
#!/bin/bash # Search for the initialization pattern of GlobalAppConfig, which might be a call to a method like GlobalAppConfig::init or similar. rg 'GlobalAppConfig::init' -C 5Length of output: 2780
The
readChromosomeIndex
method is implemented in bothHg37ChrConverter.cpp
andHg38ChrConverter.cpp
and is used across various parts of the codebase. It includes basic error handling by checking for digit characters, and the usage context in other files includes checks for invalid indices, which provides additional error handling. This aligns with the PR's objective and the reviewer's concern about error handling.Analysis chain
The
Breakpoint
constructor usesGlobalAppConfig::getInstance().getChrConverter().readChromosomeIndex
to initializechrIndex
. Ensure that this method correctly handles any potential errors or edge cases, such as invalid chromosome names or indices.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check the implementation of readChromosomeIndex for error handling rg 'readChromosomeIndex' --context 5 src/GlobalAppConfig.cppLength of output: 60
Script:
#!/bin/bash # Search for the implementation of readChromosomeIndex across the entire codebase rg 'readChromosomeIndex' --context 5Length of output: 7893
src/BreakpointReduced.cpp (3)
27-27: The addition of the
GlobalAppConfig.h
include directive is consistent with the centralization of theChrConverter
class.62-63: Initialization of
mrefHits
andgermlineInfo
with default values in the constructor is a good practice to ensure that the members are in a valid state.175-175: The use of
GlobalAppConfig::getInstance().getChrConverter()
in theprintOverhang
method is consistent with the centralization of theChrConverter
. Ensure thatGlobalAppConfig
is properly initialized before this method is called to avoid potential null reference errors.src/ChrConverter.cpp (4)
19-22: The change from angle brackets to quotes for including
ChrConverter.h
is appropriate for project-specific headers.28-42: The constructor now accepts vectors to initialize the
ChrConverter
class, which aligns with the PR's objective to support the hg38 human genome reference by making the class more flexible. The added checks for vector sizes are a good practice for error handling and input validation.26-26: Using the entire
std
namespace in a source file is generally acceptable, but it's important to ensure that this does not lead to name clashes or ambiguity in larger projects.44-44: The destructor is trivial and does not perform any actions, which is typical for classes that do not manage resources directly.
src/HelperFunctions.cpp (2)
33-36: The error handling in
error_terminating_getline
is direct and terminates the program on a read error. Ensure that this behavior is consistent with the overall error handling strategy of the application, and consider the implications of not calling destructors for objects with automatic storage duration when usingexit
.29-39: The addition of braces around the body of
error_terminating_getline
is a good practice for maintaining clear scope and avoiding potential errors related to control flow.src/Hg37ChrConverter.cpp (6)
1-17: The licensing information and author details have been added correctly, ensuring compliance and proper attribution.
31-232: The static array
indexToChr
contains a list of chromosome names. Ensure that this list is accurate and complete for the hg37 reference, as inaccuracies here could lead to incorrect chromosome conversions.234-251: The static array
indexToChrCompressedMref
is presumably a compressed mapping for the hg37 reference. Verify that the compression scheme is correct and that it aligns with the expected use cases and data formats.253-268: The static array
chrSizesCompressedMref
lists chromosome sizes. It is crucial to verify that these sizes are accurate and correspond to the hg37 reference to prevent any potential issues with genomic data processing.270-323: The
indexConverter
array maps indices to chromosomes. The use of-2
as a placeholder for non-existent or invalid indices is a design choice that should be documented to avoid confusion. Ensure that the mapping logic inindexConverter
is consistent with the expected behavior and that the-2
values are handled correctly in all parts of the code that use this converter.329-333: The constructor for
Hg37ChrConverter
initializes the base class with the static arrays defined above. Ensure that the order and the data passed to the base class constructor are correct and that they match the expected format and logic of theChrConverter
base class.src/Hg38ChrConverter.cpp (1)
- 1-17: The licensing information is correctly included at the top of the file, which is important for compliance with the project's legal requirements.
src/IndexTable.cpp (2)
1-2: The file name in the comment header is
Hg37IndexTable.cpp
, which might be misleading if this file is intended to support hg38 as per the PR objective. Please verify that the file name and header comments accurately reflect the content and purpose of the file.31-306: The static arrays
indexToChr
,indexToChrCompressedMref
, andindexConverter
provide hardcoded mappings for the hg37 human genome reference. Given the PR's objective to support hg38, please ensure that there are corresponding arrays for hg38 or a dynamic mechanism that can handle both hg37 and hg38 references to maintain the flexibility mentioned in the PR objectives.src/MasterRefProcessor.cpp (5)
26-26: The inclusion of "GlobalAppConfig.h" is consistent with the PR objectives to centralize the
ChrConverter
within theGlobalAppConfig
singleton.49-53: The use of
GlobalAppConfig::getInstance().getChrConverter().chrSizesCompressedMref
to initializemrefDb
is consistent with the PR objectives to make theChrConverter
more flexible and dynamic.100-100: The
ChrConverter
is now accessed through theGlobalAppConfig
singleton, which is in line with the PR objectives to centralize this functionality.111-111: The
indexToChrCompressedMref
is being used from theChrConverter
instance obtained fromGlobalAppConfig
, which is part of the refactoring to support different human genome references.134-139: The
processFile
function is using theChrConverter
from theGlobalAppConfig
singleton, which is consistent with the PR objectives to centralize theChrConverter
.src/MrefEntry.cpp (2)
56-56: Ensure that
GlobalAppConfig
is initialized before this call and that thechrConverter
is used correctly throughout the function.58-61: Verify that the new conditions and method calls within the conditional logic are correct and do not introduce any logical errors.
src/SuppAlignment.cpp (10)
28-28: The inclusion of
GlobalAppConfig.h
aligns with the PR objective to centralize theChrConverter
within theGlobalAppConfig
singleton.39-48: The constructor
SuppAlignment::SuppAlignment
has been refactored with additional parameters. Ensure that all instantiations of this constructor throughout the codebase have been updated to pass the correct arguments.95-96: The
ChrConverter
is now accessed via theGlobalAppConfig
singleton, which is consistent with the PR's goal of reducing widespread injection of the converter.123-126: The use of
emplace_back
to constructCigarChunk
objects in place is a good practice for performance, as it avoids unnecessary copies.182-184: The use of
unique
combined witherase
to remove duplicates fromsupportingIndicesSecondary
is correct and follows the idiomatic C++ approach.189-198: The second constructor
SuppAlignment::SuppAlignment
has been refactored with additional parameters. Ensure that all instantiations of this constructor throughout the codebase have been updated to pass the correct arguments.248-260: The
ChrConverter
fromGlobalAppConfig
. This change is consistent with the PR's objective and should not affect the method's functionality.311-312: The
ChrConverter
is accessed via theGlobalAppConfig
singleton in theSuppAlignment
constructor that takes a string argument. This is consistent with the PR's objective to centralize theChrConverter
.373-374: <!--Monologue:
The new script has successfully found two usages of thesaCloseness
method inBreakpoint.cpp
. In both instances, thefuzziness
parameter is being passed as a literal value of100
. This indicates that thefuzziness
parameter is indeed being set explicitly when the method is called.Given that the
fuzziness
parameter is being used correctly in the identified calls, and there are no other usages found in the codebase, it seems that the handling of thefuzziness
parameter in the context ofsaCloseness
is consistent at least in the places where it is used.Since the review comment was concerned with ensuring that
fuzziness
is set and used correctly, and the found usages show a consistent use of a literal value, we can conclude that the handling offuzziness
in these instances is correct. There is no evidence of incorrect usage or setting of thefuzziness
parameter based on the information provided by the scripts.Now, I will formulate a concise response to the review comment.
-->The verification process has identified two instances in
Breakpoint.cpp
where thesaCloseness
method is called with afuzziness
parameter set to a literal value of100
. There are no other usages found in the codebase. This suggests that thefuzziness
parameter is being handled consistently in the context of thesaCloseness
method.Analysis chain
The logic for determining closeness in
saCloseness
appears to be correct, but it's important to ensure that thefuzziness
parameter is being set and used correctly throughout the codebase, especially since it's being modified within the method whenstrictFuzzy
is true.* 390-391: The logic for determining closeness in `saDistHomologyRescueCloseness` is consistent with the `saCloseness` method. Ensure that the `fuzziness` parameter is being set and used correctly, as mentioned previously.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for usages of saCloseness to ensure the fuzziness parameter is being handled correctly. ast-grep --lang cpp --pattern $'$_->saCloseness($_, $_)'Length of output: 57
Script:
#!/bin/bash # Search for any usages of saCloseness to ensure the fuzziness parameter is being handled correctly. ast-grep --lang cpp --pattern $'$_->saCloseness($$$)'Length of output: 252
src/SuppAlignmentAnno.cpp (4)
26-26: The inclusion of
GlobalAppConfig.h
aligns with the PR's objective to centralizeChrConverter
access.53-57: The new method of accessing
ChrConverter
throughGlobalAppConfig
is consistent with the PR's objectives. Ensure that the conditionif (chrIndex > 1001)
is appropriate for the newChrConverter
implementation and that it doesn't introduce any regressions or logic errors.183-192: The refactoring of the
GlobalAppConfig
for accessingChrConverter
is consistent with the PR's objectives and should simplify the codebase by reducing the need to pass the converter around.55-57: Verify that the magic number
1001
used in thechrIndex
check is consistent with the expected range of chromosome indices in the new genome reference implementation.src/SvEvent.cpp (3)
25-28: The inclusion of
GlobalAppConfig.h
and the subsequent use ofGlobalAppConfig
to obtainChrConverter
instances align with the PR objectives to centralize theChrConverter
access.146-147: The check
chrConverter.indexConverter[chrIndex1] < 23
assumes a specific number of chromosomes. Ensure this is compatible with all supported genome references, including hg38.1712-1716: Usage of
GlobalAppConfig
to obtainChrConverter
for chromosome name conversion inprintMatch
is consistent with the PR objectives.src/sophia.cpp (7)
13-13: The inclusion of the
<memory>
header is appropriate for the use of smart pointers likestd::unique_ptr
.26-26: The removal of
using namespace sophia;
is consistent with the changes to use unqualified names within thesophia
namespace.35-35: The addition of the "assemblyname" option to the command-line interface is consistent with the PR's objective to support multiple human genome references.
135-135: The initialization of the
GlobalAppConfig
singleton with theChrConverter
instance is a good use of the singleton pattern to centralize access to configuration.106-107: The calculation of
Alignment::ISIZEMAX
andSuppAlignment::ISIZEMAX
using the minimum of 4000.0 or the calculated isize parameters is correct.121-132: The use of
std::unique_ptr
forchrConverter
ensures proper memory management and is consistent with modern C++ best practices.137-144: The configuration of static variables in
Alignment
andBreakpoint
classes using command-line options is a flexible approach to handle runtime settings.src/sophiaAnnotate.cpp (10)
10-10: The inclusion of
<memory>
is appropriate for the use of smart pointers in the code.32-32: The introduction of the
sophia
namespace usage is consistent with the PR objectives to simplify the codebase.40-40: The addition of the "assemblyname" command line option aligns with the PR objectives to support multiple human genome references.
58-58: The modification of the
mref
data structure declaration to useMrefEntryAnno
directly is consistent with the centralization efforts and namespace changes.129-140: The introduction and conditional initialization of the
chrConverter
unique pointer are key changes that support the PR's goal of flexible chromosome conversion.142-143: The initialization of the global application configuration using the
GlobalAppConfig
class is in line with the PR's objective to centralize theChrConverter
.153-154: The change to call
error_terminating_getline
usingconfig.getChrConverter()
reflects the centralization of theChrConverter
within theGlobalAppConfig
.165-187: The updates to static member assignments are likely related to the new functionality and support for hg38, as per the PR objectives.
203-207: The toggling of
SvEvent::NOCONTROLMODE
within a block scope is a logical change to handle different modes of operation based on the presence of control results.209-215: The conditional instantiation of
AnnotationProcessor
with different parameters based on the presence of control results supports the PR's objective to handle different genome references and analysis modes.src/sophiaMref.cpp (6)
16-20: The inclusion of new header files aligns with the PR objectives to support hg38 and refactor the
ChrConverter
.30-30: The addition of the
assemblyname
command line option is necessary for specifying the genome assembly and supports the new functionality.87-98: The logic to initialize the appropriate
ChrConverter
subclass based on theassemblyname
input is correctly implemented.100-101: Proper initialization of the global application configuration with the
chrConverter
usingstd::move
to avoid unnecessary copying.103-105: The update to the usage of
SuppAlignment::DEFAULTREADLENGTH
,SuppAlignmentAnno::DEFAULTREADLENGTH
, andMrefEntry::NUMPIDS
is consistent with the PR objectives and logically implemented.23-108: The modifications in the
main
function to handle theassemblyname
option and initialize the global application configuration are consistent with the PR objectives and correctly implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Files selected for processing (10)
- README.md (3 hunks)
- src/AnnotationProcessor.cpp (10 hunks)
- src/BreakpointReduced.cpp (4 hunks)
- src/DeFuzzier.cpp (9 hunks)
- src/MrefEntry.cpp (9 hunks)
- src/SamSegmentMapper.cpp (1 hunks)
- src/SuppAlignment.cpp (10 hunks)
- src/sophia.cpp (2 hunks)
- src/sophiaAnnotate.cpp (5 hunks)
- src/sophiaMref.cpp (1 hunks)
Files skipped from review due to trivial changes (2)
- src/DeFuzzier.cpp
- src/SamSegmentMapper.cpp
Files skipped from review as they are similar to previous changes (7)
- README.md
- src/AnnotationProcessor.cpp
- src/BreakpointReduced.cpp
- src/MrefEntry.cpp
- src/SuppAlignment.cpp
- src/sophia.cpp
- src/sophiaMref.cpp
Additional comments: 8
src/sophiaAnnotate.cpp (8)
10-10: The inclusion of the
<memory>
header is appropriate for the use of smart pointers in the code.40-40: The addition of the "assemblyname" command line option is necessary for supporting different genome assemblies.
129-140: The use of
std::unique_ptr
for polymorphic management ofChrConverter
objects is a good practice for resource management.142-143: Centralizing the
ChrConverter
within theGlobalAppConfig
singleton aligns with the PR objectives and reduces the need for widespread injection.145-187: Assigning new values to static member variables based on command line options is necessary for dynamic application configuration.
157-158: The change to call
error_terminating_getline
withconfig.getChrConverter()
instead ofsophia::ChrConverter
directly is consistent with the centralization of theChrConverter
.207-213: Toggling the
SvEvent::NOCONTROLMODE
static variable within the same scope is acceptable, but be cautious if this variable is accessed from multiple threads.215-225: The use of
AnnotationProcessor
with different parameters based on the presence of control results is a logical change to handle different use cases.
Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 21
Configuration used: CodeRabbit UI
Files selected for processing (22)
- src/Alignment.cpp (2 hunks)
- src/AnnotationProcessor.cpp (5 hunks)
- src/BreakpointReduced.cpp (1 hunks)
- src/ChosenBp.cpp (1 hunks)
- src/ChrConverter.cpp (2 hunks)
- src/DeFuzzier.cpp (1 hunks)
- src/GermlineMatch.cpp (1 hunks)
- src/GlobalAppConfig.cpp (1 hunks)
- src/HelperFunctions.cpp (1 hunks)
- src/Hg37ChrConverter.cpp (1 hunks)
- src/Hg38ChrConverter.cpp (1 hunks)
- src/MasterRefProcessor.cpp (2 hunks)
- src/MrefEntry.cpp (2 hunks)
- src/MrefEntryAnno.cpp (1 hunks)
- src/MrefMatch.cpp (1 hunks)
- src/SamSegmentMapper.cpp (1 hunks)
- src/Sdust.cpp (1 hunks)
- src/SuppAlignment.cpp (1 hunks)
- src/SuppAlignmentAnno.cpp (1 hunks)
- src/sophia.cpp (2 hunks)
- src/sophiaAnnotate.cpp (5 hunks)
- src/sophiaMref.cpp (1 hunks)
Files not summarized due to errors (2)
- src/Alignment.cpp: Error: Message exceeds token limit
- src/AnnotationProcessor.cpp: Error: Message exceeds token limit
Files skipped from review due to trivial changes (6)
- src/DeFuzzier.cpp
- src/GermlineMatch.cpp
- src/MrefEntry.cpp
- src/MrefEntryAnno.cpp
- src/MrefMatch.cpp
- src/SuppAlignment.cpp
Files skipped from review as they are similar to previous changes (5)
- src/BreakpointReduced.cpp
- src/GlobalAppConfig.cpp
- src/MasterRefProcessor.cpp
- src/SamSegmentMapper.cpp
- src/sophiaMref.cpp
Additional comments: 52
src/Alignment.cpp (15)
27-27: Including
GlobalAppConfig.h
aligns with the PR objective of centralizing theChrConverter
within theGlobalAppConfig
singleton.73-74: The use of
GlobalAppConfig::getInstance().getChrConverter().readChromosomeIndex
is consistent with the PR objective of centralizing theChrConverter
within theGlobalAppConfig
singleton.94-94: The use of
std::bitset
to parse flags is a good approach for handling binary flag data.174-175: The use of
GlobalAppConfig::getInstance().getChrConverter().readChromosomeIndex
here is consistent with the PR objective and the AI-generated summary.180-202: The method
mappingQualityCheck
is specific and well-contained, which is good for maintainability.206-224: The method
isEventCandidate
is concise and its purpose is clear, which is good practice.345-370: The method
qualityCheckCascade
is a good example of a method that delegates to other methods to perform its work, which is a good practice for maintainability.373-392: The method
clipCountCheck
is well-contained and its logic is clear, which is good for maintainability.471-498: The method
overhangMedianQuality
is a good example of a method that performs a single, well-defined task.502-546: The method
assessReadType
is another example of a method that performs a single, well-defined task.549-575: The method
assessOutlierMateDistance
is specific and well-contained, which is good for maintainability.660-674: The method
printOverhang
is concise and its purpose is clear, which is good practice.678-723: The method
overhangComplexityMaskRatio
is a good example of a method that performs a single, well-defined task.727-745: The method
fullMedianQuality
is specific and well-contained, which is good for maintainability.748-761: The method
getMedian
is a good example of a method that performs a single, well-defined task.src/AnnotationProcessor.cpp (13)
28-28: The inclusion of
GlobalAppConfig.h
suggests that theAnnotationProcessor
class now relies on theGlobalAppConfig
singleton for chromosome conversion functionality, which aligns with the PR objectives to centralize theChrConverter
.63-63: The use of
GlobalAppConfig::getInstance().getChrConverter()
to obtain a reference to theChrConverter
instance is consistent with the PR's goal of centralizing the converter access. This change should simplify the codebase by reducing the need to pass the converter around.42-108: The constructor of
AnnotationProcessor
has been significantly refactored. It now initializes several member variables and processes tumor results. It is important to ensure that the initialization of these members is consistent with their intended use throughout the class and that the processing logic correctly handles the new genome reference support.111-286: The search for instances of
ChrConverter
usage withinsrc/AnnotationProcessor.cpp
has returned several lines where theChrConverter
is being accessed via theGlobalAppConfig
singleton. This confirms that theAnnotationProcessor
is utilizing theChrConverter
for chromosome index conversions, which is a key part of the new genome reference support.Given this information, it appears that the
AnnotationProcessor
constructor logic for control results is correctly using theChrConverter
, and the changes are compatible with the new genome reference support. The constructor retrieves theChrConverter
instance and uses it to convert chromosome indices, which is consistent with the PR's objective to support the hg38 human genome reference.Based on the provided context and the results from the shell scripts, the verification process has not revealed any issues with the constructor logic or compatibility with the new genome reference support.
Analysis chain
The overloaded constructor of
AnnotationProcessor
for cases with control results has been updated. It is crucial to verify that the logic for processing control results is correct and that the changes are compatible with the new genome reference support.* 290-305: The `searchMatches` method has been updated. It is important to ensure that the logic for searching matches in the tumor results is correct and that it properly interacts with the new genome reference support.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the overloaded constructor logic for control results is correct. rg "AnnotationProcessor::AnnotationProcessor" -- -C 10 # Verify compatibility with the new genome reference support. ast-grep --lang cpp --pattern $'AnnotationProcessor::AnnotationProcessor($_, $_, $_, $_, $_, $_, $_, $_) { $$$ }'Length of output: 257
Script:
#!/bin/bash # Verify that the overloaded constructor logic for control results is correct. rg "AnnotationProcessor::AnnotationProcessor" -C 10 # Verify compatibility with the new genome reference support. ast-grep --lang cpp --pattern $'class AnnotationProcessor { $$$ AnnotationProcessor($_, $_, $_, $_, $_, $_, $_, $_) { $$$ } $$$ }'Length of output: 3404
Script:
#!/bin/bash # Search for instances of ChrConverter usage in AnnotationProcessor to ensure compatibility with new genome reference support. rg "ChrConverter" src/AnnotationProcessor.cppLength of output: 473
308-409: The
searchSa
method has been updated. It is important to ensure that the logic for searching supplementary alignments is correct and that it properly interacts with the new genome reference support.412-447: The
createDoubleMatchSv
method has been updated. It is important to ensure that the logic for creating double match structural variants is correct and that it properly interacts with the new genome reference support.583-718: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [552-679]
The
searchMrefHitsNew
method has been updated. It is important to ensure that the logic for searching Mref hits is correct and that it properly interacts with the new genome reference support.
- 725-1226: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [683-854]
The
searchGermlineHitsNew
method has been updated. It is important to ensure that the logic for searching germline hits is correct and that it properly interacts with the new genome reference support.
858-1067: The
applyMassiveInversionFiltering
method has been updated. It is important to ensure that the logic for filtering massive inversions is correct and that it properly interacts with the new genome reference support.1070-1144: The
applyPathogenContaminationFiltering
method has been updated. It is important to ensure that the logic for filtering pathogen contamination is correct and that it properly interacts with the new genome reference support.1147-1168: The
printFilteredResults
method has been updated. It is important to ensure that the logic for printing filtered results is correct and that it properly interacts with the new genome reference support.1171-1224: <!--Monologue:
The new script has successfully found references toGlobalAppConfig
within theAnnotationProcessor.cpp
file, which indicates that theChrConverter
is being accessed through theGlobalAppConfig
singleton. This is consistent with the PR objective of centralizing theChrConverter
within theGlobalAppConfig
to reduce widespread injection throughout the codebase.However, there are no direct references to
Hg37ChrConverter
orHg38ChrConverter
within theAnnotationProcessor.cpp
file based on the script output. This suggests that the interaction with the new genome reference support is abstracted through theChrConverter
interface, and the specific implementations (Hg37ChrConverter
andHg38ChrConverter
) are not directly used in this file. Instead, theChrConverter
is accessed via theGlobalAppConfig
singleton, which is in line with the PR's objective of centralization.Given this information, it seems that the
printUnresolvedRareOverhangs
method does not directly interact with the new genome reference support but relies on theChrConverter
interface provided by theGlobalAppConfig
. This abstraction is a good design choice as it allows the method to work with any genome reference without needing to know the specifics of the implementation.Based on the provided context and the PR objectives, it appears that the
printUnresolvedRareOverhangs
method's logic is correct, and it properly interacts with the new genome reference support through theChrConverter
interface. Therefore, I can conclude that the changes have been verified successfully.
-->The
printUnresolvedRareOverhangs
method inAnnotationProcessor.cpp
correctly uses theChrConverter
interface provided by theGlobalAppConfig
singleton, which aligns with the PR's objective of centralizing theChrConverter
. There are no direct interactions with the specificHg37ChrConverter
orHg38ChrConverter
classes in this file, suggesting that the method is designed to work with any genome reference through the provided interface. The changes appear to be correct and properly implemented.Analysis chain
The
printUnresolvedRareOverhangs
method has been updated. It is important to ensure that the logic for printing unresolved rare overhangs is correct and that it properly interacts with the new genome reference support.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the printUnresolvedRareOverhangs method logic is correct. rg "AnnotationProcessor::printUnresolvedRareOverhangs" -- -C 10 # Verify interaction with the new genome reference support. ast-grep --lang cpp --pattern $'void AnnotationProcessor::printUnresolvedRareOverhangs($_) { $$$ }'Length of output: 252
Script:
#!/bin/bash # Correct the rg command to properly search for the method and provide context lines. rg "AnnotationProcessor::printUnresolvedRareOverhangs" -C 10 # Correct the ast-grep command to ensure it's searching for the method correctly. ast-grep --lang cpp --pattern $'void AnnotationProcessor::printUnresolvedRareOverhangs($_) { $$$ }'Length of output: 1576
Script:
#!/bin/bash # Search for references to the new genome reference classes within the AnnotationProcessor.cpp file. rg "Hg37ChrConverter|Hg38ChrConverter|GlobalAppConfig" src/AnnotationProcessor.cppLength of output: 541
src/ChosenBp.cpp (3)
30-30: The initialization of
BPSUPPORTTHRESHOLD
within theChosenBp
class changes its scope. Verify that all references to this variable have been updated to access it through theChosenBp
class.37-61: The updated
addSupplementaryAlignments
method introduces several new behaviors:
- It now checks for closeness of supplementary alignments and handles fuzzy alignments differently based on whether the existing alignment is fuzzy.
- It updates the mapping quality if the new alignment has a higher
mapq
.- It increments the count of distinct reads.
Ensure that these changes are consistent with the intended logic for handling supplementary alignments and that they have been tested thoroughly, especially the fuzzy alignment logic and the conditions under which the mapping quality is updated.
- 54-54: The line
// it->addSupportingIndices(sa.getSupportingIndices());
is commented out. If this is an intentional change, ensure that the handling of supporting indices is correctly implemented elsewhere. If this is a leftover from development, consider removing the commented-out code to avoid confusion.src/ChrConverter.cpp (1)
- 26-40: The constructor now initializes member vectors and includes size validation checks to ensure that related vectors have matching sizes. This is a critical change for correctness and data integrity.
src/Hg37ChrConverter.cpp (2)
29-230: The
indexToChr
vector contains a comprehensive list of chromosome identifiers. Verify that all identifiers are accurate and correspond to the hg37 human genome reference.268-321: The
indexConverter
vector uses -2 as a placeholder for many indices. Ensure that these values are handled correctly throughout the code and that their purpose is clearly documented to avoid confusion.src/Hg38ChrConverter.cpp (1)
- 25-332: The implementation of the
Hg38ChrConverter
class with static arrays for chromosome indexing and conversion aligns with the PR's objective to support the hg38 human genome reference.src/Sdust.cpp (6)
30-47: The constructor
Sdust::Sdust
has been modified to handle a new parameteroverhangIn
. Ensure that all instantiations ofSdust
throughout the codebase have been updated to pass this new parameter.51-74: In the
Sdust::saveMaskedRegions
method, the use ofrbegin()
andprev(P.end())
to access and erase elements from theP
set is correct, but it's important to ensure thatP
is not empty before calling these methods to avoid undefined behavior.77-101: In the
Sdust::findPerfectRegions
method, the loop variablei
is decremented, which is unusual for afor
loop. Ensure that this logic is intentional and correctly implemented. Additionally, the calculation ofnewScore
should be carefully reviewed to ensure that it does not result in division by zero whenw.size()
is equal toi + 1
.103-126: The
Sdust::shiftWindow
method includes a do-while loop that removes triplet information. It's crucial to ensure that the conditions != t
will eventually be met to prevent an infinite loop.128-137: The
Sdust::addTripletInfo
andSdust::removeTripletInfo
methods have been updated with additional logic. It's important to ensure that the changes to ther
andc
variables are correct and that there are no off-by-one errors or other logical mistakes.140-143: The
Sdust::triplet
method has been modified to include additional calculations. Ensure that the indexing intooverhangIn
is correct and that there are no out-of-bounds accesses.src/SuppAlignmentAnno.cpp (5)
26-26: The inclusion of
GlobalAppConfig.h
is consistent with the PR's objective to centralize theChrConverter
within theGlobalAppConfig
singleton.38-51: The constructor now uses the
GlobalAppConfig
singleton to access theChrConverter
, which is in line with the PR's objective to centralize theChrConverter
.167-180: The
ChrConverter
from theGlobalAppConfig
singleton, ensuring consistency with the centralized approach.262-285: The
mergeMrefSa
method has been updated. Ensure that these changes are documented in the PR if they are not directly related to the centralization ofChrConverter
or support for hg38.288-296: The
finalizeSupportingIndices
method now setssecondarySupport
to 0. Verify that this change is intentional and document it in the PR if it's not directly related to the centralization ofChrConverter
or support for hg38.src/sophia.cpp (1)
- 105-105: The use of
min
andround
without thestd::
prefix assumes that thestd
namespace is being used globally or in the surrounding context. If this is not the case, it could lead to ambiguity or compilation errors. Please ensure that thestd
namespace is appropriately declared or use thestd::
prefix for clarity and to avoid potential conflicts with functions from other namespaces.Also applies to: 113-113, 146-147
src/sophiaAnnotate.cpp (5)
10-10: The inclusion of
<memory>
is appropriate for the use of smart pointers in the code.39-39: The addition of the "assemblyname" command line option is a good implementation to support different genome references.
128-138: The use of
unique_ptr
forchrConverter
and the conditional initialization based on the "assemblyname" option is a good practice for resource management and flexibility.142-142: Using
std::move
to pass theunique_ptr
toGlobalAppConfig::init
is efficient as it avoids unnecessary copying.156-157: The change to call
error_terminating_getline
usingconfig.getChrConverter()
aligns with the centralization ofChrConverter
within theGlobalAppConfig
singleton.
Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 4
Configuration used: CodeRabbit UI
Files selected for processing (2)
- include/ChrConverter.h (2 hunks)
- src/Hg37ChrConverter.cpp (1 hunks)
Files not summarized due to errors (1)
- src/Hg37ChrConverter.cpp: Error: Message exceeds token limit
Additional comments: 7
src/Hg37ChrConverter.cpp (7)
1-17: The licensing header is clear and well-documented, ensuring compliance with the GPL.
29-230: The static array
indexToChr
contains a large number of string literals representing chromosome names. Ensure that all these chromosome names are accurate and correspond to the expected values for the hg37 human genome reference.232-249: The static array
indexToChrCompressedMref
is presumably a compressed mapping for the chromosome names. Verify that the compression scheme is correct and that it matches the expected values for the hg37 human genome reference.251-266: The static array
chrSizesCompressedMref
lists chromosome sizes. Confirm that these sizes are accurate and correspond to the hg37 reference.268-321: The static array
indexConverter
is used to map indices. It contains a large number of-2
values, which might represent some form of 'not found' or 'not applicable'. Ensure that this mapping is correct and that the-2
values are used consistently and appropriately within the context of the application.327-331: The constructor for
Hg37ChrConverter
initializes the base class with the static arrays defined above. Ensure that the order of the parameters matches the expected order in theChrConverter
constructor and that the arrays are being used correctly.333-388: The
readChromosomeIndex
function contains logic for parsing chromosome identifiers from a character stream. Review the parsing rules to ensure they are consistent with the expected input formats and that the function handles all edge cases correctly. Pay special attention to the handling of the 'G' case for GL chromosomes and the default case, which returns1003
.
Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 4
Configuration used: CodeRabbit UI
Files selected for processing (2)
- src/Hg37ChrConverter.cpp (1 hunks)
- src/Hg38ChrConverter.cpp (1 hunks)
Files not summarized due to errors (2)
- src/Hg37ChrConverter.cpp: Error: Message exceeds token limit
- src/Hg38ChrConverter.cpp: Error: Message exceeds token limit
Additional comments: 7
src/Hg37ChrConverter.cpp (5)
29-230: The
indexToChr
vector contains a large number of chromosome identifiers. Verify that all these identifiers are accurate and consistent with the hg37 human genome build. Inaccuracies here could lead to incorrect chromosome conversions.270-323: The
indexConverter
vector is used to map chromosome names to indices. Ensure that this mapping is correct and that the special cases are handled properly. The large number ofNA
values should correspond to non-existent chromosomes for the hg37 build, and this should be verified for accuracy.252-266: The
chrSizesCompressedMref
vector contains chromosome sizes. Confirm that these sizes are accurate and correspond to the correct chromosomes for the hg37 build.327-327: The
assembly_name
is set to "hg37". Ensure that this new assembly name is correctly handled throughout the codebase, and that any existing functionality that depends on the assembly name is updated accordingly.335-390: Reiterate the previous comment about adding unit tests for the
readChromosomeIndex
function to cover various input scenarios, including edge cases and potential error conditions. This is crucial for ensuring the robustness of the parsing logic.src/Hg38ChrConverter.cpp (2)
19-19: Ensure that the
Hg38ChrConverter.h
header file includes all necessary dependencies and that the class interface is consistent with the rest of the codebase.29-250: The static vectors for chromosome mapping appear to be correct, but it's important to verify that the indices and chromosome names match the hg38 human genome build specifications.
* Added Makefile to replace build scripts. Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 28
Configuration used: CodeRabbit UI
Files ignored due to filter (19)
- GRCh38.contigInformation.selection.tsv
- GRCh38.contigInformation.tsv
- sophiaMref/src/Alignment.o
- sophiaMref/src/AnnotationProcessor.o
- sophiaMref/src/Breakpoint.o
- sophiaMref/src/BreakpointReduced.o
- sophiaMref/src/ChosenBp.o
- sophiaMref/src/ChrConverter.o
- sophiaMref/src/DeFuzzier.o
- sophiaMref/src/GermlineMatch.o
- sophiaMref/src/MasterRefProcessor.o
- sophiaMref/src/MrefEntry.o
- sophiaMref/src/MrefEntryAnno.o
- sophiaMref/src/MrefMatch.o
- sophiaMref/src/SamSegmentMapper.o
- sophiaMref/src/Sdust.o
- sophiaMref/src/SuppAlignment.o
- sophiaMref/src/SuppAlignmentAnno.o
- sophiaMref/src/SvEvent.o
Files selected for processing (20)
- Makefile (1 hunks)
- README.md (2 hunks)
- include/ChrConverter.h (2 hunks)
- include/GlobalAppConfig.h (1 hunks)
- include/Hg37ChrConverter.h (1 hunks)
- include/Hg38ChrConverter.h (1 hunks)
- include/MasterRefProcessor.h (2 hunks)
- src/Alignment.cpp (2 hunks)
- src/AnnotationProcessor.cpp (5 hunks)
- src/BreakpointReduced.cpp (1 hunks)
- src/ChrConverter.cpp (2 hunks)
- src/GlobalAppConfig.cpp (1 hunks)
- src/Hg37ChrConverter.cpp (1 hunks)
- src/MasterRefProcessor.cpp (2 hunks)
- src/MrefEntry.cpp (2 hunks)
- src/SuppAlignment.cpp (1 hunks)
- src/SuppAlignmentAnno.cpp (1 hunks)
- src/sophia.cpp (2 hunks)
- src/sophiaAnnotate.cpp (5 hunks)
- src/sophiaMref.cpp (1 hunks)
Files not summarized due to errors (2)
- src/Alignment.cpp: Error: Message exceeds token limit
- src/AnnotationProcessor.cpp: Error: Message exceeds token limit
Additional comments: 120
README.md (10)
3-10:
The additions to the README.md provide a clear and detailed description of the SOPHIA tool, its optimization for hg38, and references for further reading, which align with the PR objectives to support hg38 and improve documentation.12-12:
The information about SOPHIA's performance and resource requirements is valuable for users to understand the tool's efficiency and operational needs.14-16:
The inclusion of SOPHIA in the SophiaWorkflow and the reference to the original repository are helpful for users to navigate related resources.19-26:
The citation section is well-formed and provides a proper way to cite the SOPHIA tool, which is important for academic and research purposes.28-32:
The tools section clearly describes the different components of SOPHIA and their respective roles, which is beneficial for users to understand the ecosystem of tools.34-34:
The instruction for command-line parameters is a good addition for users to quickly find help regarding tool usage.38-42:
The update of the Boost dependency to version 1.82.0 in the runtime dependencies section reflects the changes mentioned in the AI-generated summary.46-68:
The build instructions are clear and provide both dynamic and static build options, which is useful for users with different build preferences.88-97:
The changes section provides a concise summary of the updates in version 35.1.0, including the support for hg38 and code improvements, which align with the PR objectives.99-100:
The mention of the last version in the original Bitbucket repository is a good historical reference for users who may be looking for the original version of SOPHIA.include/ChrConverter.h (7)
2-2: The authorship information has been updated to reflect the new contributor.
27-28: Removal of
using namespace std;
from the header file is a good practice to avoid namespace pollution.34-76: The
ChrConverter
class has been updated with virtual methods to support flexibility for different genome builds.29-32: Type definitions for
ChrIndex
andCompressedMrefIndex
have been added for clarity, though they are not type-checked in C++17.46-47: Ensure that the static member variable
assemlyName
is properly initialized and thread-safe if accessed concurrently.70-74: The
readChromosomeIndex
method has been removed. Verify that all references to this method in the codebase have been updated to use the newparseChrAndReturnIndex
method.19-20: The header guard identifier has been updated to
_CHRCONVERTER_H_
for consistency.include/GlobalAppConfig.h (4)
41-46: The use of
std::unique_ptr<ChrConverter const>
forchrConverter
ensures that theChrConverter
instance is immutable, which is appropriate for a configuration that should not change at runtime.58-59: The
init
method is used to initialize the singleton instance. Ensure that this method is implemented in a thread-safe manner, particularly that it handles the case where multiple threads may attempt to initialize the singleton concurrently.52-56: The copy constructor and assignment operator are correctly deleted to prevent copying of the singleton, which is a good practice for maintaining the integrity of the singleton pattern.
50-50: The
getChrConverter
method returns a reference to theChrConverter
, which is appropriate for providing access to the configuration. Ensure that this does not inadvertently allow for theChrConverter
to be modified if it is not intended to be immutable.include/Hg37ChrConverter.h (8)
19-20: Ensure that the header guard naming convention is consistent with the rest of the codebase.
32-38: The constructor is protected, which may limit instantiation. Confirm that this is intentional and aligns with the design pattern being used.
40-50: The member variables are declared as
const
, which ensures immutability. Verify that there's no future requirement to modify these vectors after construction, as this would necessitate removing theconst
qualifier.54-54: The static member variable
assemblyName
must be defined in the corresponding.cpp
file. Verify that this is done to avoid linker errors.56-81: The class methods appear to be correctly declared and named, providing a clear interface for chromosome conversion functionality.
27-85: The use of the
sophia
namespace is good practice for encapsulating the class and avoiding name clashes in larger codebases.1-17: The GPL license header is appropriately included, ensuring that the licensing terms are clear.
22-24: The inclusion of necessary headers such as
<vector>
,<string>
, and"ChrConverter.h"
is appropriate for the class functionality.include/Hg38ChrConverter.h (1)
- 32-98: The
Hg38ChrConverter
class is well-structured and includes appropriate methods and member variables for handling the hg38 human genome build. The use ofunordered_map
andvector
aligns with the PR's objective to handle variable length chromosome data. The protected constructor and theconst
member variables ensure controlled instantiation and data immutability. The public static constantassemblyName
is a good design choice for shared properties. The reliance on the Boost library for unordered maps should be documented if not already. Overall, the class design and implementation appear to be consistent with the objectives of the PR and good C++ practices.include/MasterRefProcessor.h (4)
29-29: The addition of the
ChrConverter.h
include directive aligns with the PR objectives to centralize theChrConverter
class.46-47: Ensure that
vector
andstring
are intended to be used from thestd
namespace and are not custom types that were previously being used without thestd::
prefix.45-51: Explicitly specifying access specifiers and using default for the destructor in the
MasterRefProcessor
class enhances clarity.54-59: Verify that the updates to member variables and methods in the
MasterRefProcessor
class are consistent with their intended use and that any necessary refactoring has been completed elsewhere in the codebase.src/Alignment.cpp (4)
27-27: The inclusion of
GlobalAppConfig.h
indicates that theGlobalAppConfig
singleton pattern is being used as intended in the PR objectives.37-41: Static member variables are correctly initialized outside the class definition, adhering to the previous review comment.
43-43: The static member
ISIZEMAX
is correctly initialized outside the class definition, adhering to the previous review comment.44-76: The constructor of
Alignment
uses a member initializer list, which is a good practice for readability and performance, addressing the previous review comment.src/AnnotationProcessor.cpp (14)
28-28: The inclusion of the
GlobalAppConfig.h
header is consistent with the PR's objective to centralize theChrConverter
instance within a singleton class.42-108: The constructor of
AnnotationProcessor
has been significantly refactored. It's important to ensure that the initialization of member variables is consistent and that the use of theGlobalAppConfig
singleton is correct.63-63: The use of
GlobalAppConfig::getInstance().getChrConverter()
to obtain the chromosome converter instance is in line with the PR's goal to centralize theChrConverter
within a singleton. This reduces the need to pass it around the codebase.111-285: The overloaded constructor of
AnnotationProcessor
has been updated to handle control results. It's important to verify that the logic for processing control results is correct and that the use of theGlobalAppConfig
singleton is consistent with the rest of the codebase.131-131: Again, the use of
GlobalAppConfig::getInstance().getChrConverter()
is noted here, which is consistent with the PR's objective.289-305: The
searchMatches
method has been updated. It is crucial to ensure that the logic for matching support alignments is correct and that it adheres to the new structure of theChrConverter
.307-409: The
searchSa
method has been updated. It is important to verify that the changes are consistent with the PR's objectives and that the method's logic is correct, especially with respect to the new chromosome conversion logic.411-447: The
createDoubleMatchSv
method has been updated. It is important to ensure that the logic for creating double matches is correct and that it integrates well with the newChrConverter
logic.583-718: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [551-679]
The
searchMrefHitsNew
method has been updated. It is crucial to verify that the logic for searching Mref hits is correct and that it adheres to the new structure of theChrConverter
.
- 725-1226: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [682-854]
The
searchGermlineHitsNew
method has been updated. It is important to ensure that the logic for searching germline hits is correct and that it integrates well with the newChrConverter
logic.
857-1067: The
applyMassiveInversionFiltering
method has been updated. It is important to verify that the logic for filtering based on massive inversion is correct and that it adheres to the new structure of theChrConverter
.1070-1144: The
applyPathogenContaminationFiltering
method has been updated. It is crucial to verify that the logic for filtering based on pathogen contamination is correct and that it integrates well with the newChrConverter
logic.1147-1168: The
printFilteredResults
method has been updated. It is important to ensure that the logic for printing filtered results is correct and that it adheres to the new structure of theChrConverter
.1171-1224: The
printUnresolvedRareOverhangs
method has been updated. It is crucial to verify that the logic for printing unresolved rare overhangs is correct and that it integrates well with the newChrConverter
logic.src/BreakpointReduced.cpp (7)
41-77: Ensure that the
tmpBp
object passed to theBreakpointReduced
constructor is fully initialized and contains all the necessary data, as it is used extensively in the initialization list.81-106: Review the logic changes in the
complexRearrangementMateRatioRescue
function to ensure they are correct and intentional.206-211: Confirm that the reformatting of the
removeMarkedFuzzies
function has not introduced any logic changes or errors.147-155: Review the logic change in the
searchFuzzySa
function to ensure it is correct and intentional.159-179: Review the logic changes in the
testOverhangBasedCandidacy
function to ensure they are correct and intentional.134-144: Confirm that the reformatting of the
addDummySa
andgetDummySa
functions has not introduced any logic changes or errors.108-132: Review the new constructor
BreakpointReduced::BreakpointReduced
to ensure it correctly initializes the newBreakpointReduced
object from aSuppAlignmentAnno
and an existingBreakpointReduced
object.src/ChrConverter.cpp (4)
- 1-5: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [1-17]
The authorship information has been updated to reflect the current maintainer. This is a standard practice for maintaining the accuracy of the documentation.
19-22: The inclusion of
<vector>
,<stdexcept>
, and<string>
headers aligns with the PR objective to make theChrConverter
class more flexible by using avector
instead of anarray
. Ensure that the implementation ofvector
is correct and does not introduce any issues, such as memory leaks or performance degradation.26-26: The destructor for
ChrConverter
has been defined, which is necessary for proper resource management, especially since the class is now using dynamic data structures likevector
.19-28: The removal of the arrays
indexToChr
,indexToChrCompressedMref
, andindexConverter
is not shown in the hunks but is mentioned in the AI-generated summary. Verify that their removal does not affect other parts of the codebase where they might have been used.Verification successful
The removal of the arrays
indexToChr
,indexToChrCompressedMref
, andindexConverter
fromChrConverter.cpp
has been accounted for in the codebase. The references to these arrays have been replaced with vectors in theHg37ChrConverter
andHg38ChrConverter
subclasses, and the rest of the codebase has been updated to use these new vectors through theChrConverter
interface. No issues have been found regarding their removal.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for usage of removed arrays in the codebase. rg -t cpp 'indexToChr|indexToChrCompressedMref|indexConverter' --files-with-matchesLength of output: 348
Script:
#!/bin/bash # Check the content of the references to the removed arrays in the listed files. rg -t cpp 'indexToChr|indexToChrCompressedMref|indexConverter' --context 2Length of output: 13465
src/GlobalAppConfig.cpp (4)
41-49: The implementation of the singleton pattern with thread safety using
std::mutex
andstd::lock_guard
is correct and follows best practices.39-39: The destructor is empty. If
GlobalAppConfig
is responsible for managing the lifetime ofchrConverter
, it should ensure that the resources are properly released.32-34: Returning a reference from a
unique_ptr
managed object is acceptable as long as the lifetime of the managed object is guaranteed to outlast the reference usage.36-37: The constructor correctly takes ownership of the
ChrConverter
through aunique_ptr
, which is good practice for resource management.src/Hg37ChrConverter.cpp (10)
1-17:
The updated author information and license block at the top of the file are standard and correct.19-22:
The inclusion of necessary headers such as<vector>
and<string>
is appropriate for the operations performed in this file.25-327:
The use of namespaces to encapsulate theHg37ChrConverter
class is a good practice to avoid naming conflicts and to organize code logically.29-250:
The conversion from an array to astd::vector
forindexToChr
and related variables is consistent with the PR's goal to handle variable length chromosome data.360-366:
The methodindexToChrName
correctly checks for an "INVALID" chromosome index and throws a runtime error if encountered, which is an appropriate use of exception handling for error cases.327-327:
The static memberassemblyName
is correctly set to "hg37" to reflect the human genome build this class is intended for.345-349:
The default constructor delegates to the parameterized constructor with default values, which is a clean way to initialize the class with default settings.351-357:
MethodsnChromosomes
andnChromosomesCompressedMref
provide a simple and direct way to query the number of chromosomes, which is a necessary feature for the converter.370-372:
The methodindexToChrNameCompressedMref
provides a straightforward mapping from an index to a chromosome name for compressed Mref, which is consistent with the functionality described in the PR.375-381:
The methodchrSizeCompressedMref
correctly maps a compressed Mref index to its corresponding chromosome size, which is a necessary feature for the converter.src/MasterRefProcessor.cpp (6)
26-26: The inclusion of
GlobalAppConfig.h
aligns with the PR's objective to centralize theChrConverter
instance.47-47: The initialization of
chrConverter
from theGlobalAppConfig
singleton in the constructor is consistent with the PR's objective to reduce the need for passingChrConverter
around the codebase.133-133: The initialization of
chrConverter
from theGlobalAppConfig
singleton in theprocessFile
method is consistent with the PR's objective and improves maintainability by using the centralizedChrConverter
.162-175: Ensure that the
mrefDb
vector is properly synchronized ifMasterRefProcessor
instances are accessed from multiple threads, given thatGlobalAppConfig
includes a mutex for thread safety.131-131: Verify if the hardcoded value
85
for the size offileBps
vector in theMasterRefProcessor
constructor should be dynamic based on the genome build being processed.98-98: Verify if the hardcoded index
i = 84
in theMasterRefProcessor
constructor should be dynamic based on the genome build being processed.src/MrefEntry.cpp (9)
25-25: The inclusion of
GlobalAppConfig.h
aligns with the PR objectives to centralize theChrConverter
within aGlobalAppConfig
singleton.34-36: Initialization of class members
doubleFormatter
,NUMPIDS
, andDEFAULTREADLENGTH
using the boost library is consistent with the summary.48-133: The
addEntry
method has been significantly updated with additional conditional checks and data processing logic. Ensure that the new logic correctly implements the intended functionality and that theGlobalAppConfig
singleton is being used appropriately.135-151: The
mergeMrefEntries
method has been updated to merge entries and update the validity score. Ensure that the merging logic is correct and that the validity score is updated appropriately.154-193: The
printBpInfo
method has been modified to include new fields and data processing logic. Ensure that the new fields are correctly formatted and that the data processing logic is accurate.196-210: The
printArtifactRatios
method has been updated to include new fields and data processing logic. Ensure that the new fields are correctly formatted and that the data processing logic is accurate.213-224: The
searchFuzzySa
method has been updated to include additional conditional checks and data processing logic. Ensure that the conditional checks are correct and that the data processing logic accurately reflects the intended functionality.227-249: The
saMatcher
method has been significantly updated with additional conditional checks and data processing logic. Ensure that the conditional checks are correct and that the data processing logic accurately reflects the intended functionality.252-260: The
finalizeFileIndices
method has been updated to involve data processing logic. Ensure that the data processing logic is correct and that it accurately reflects the intended functionality.src/SuppAlignment.cpp (6)
28-28: The inclusion of
GlobalAppConfig.h
aligns with the PR's goal to centralize theChrConverter
within a singleton pattern.37-76: The constructor has been significantly expanded with many member initializers. Ensure that all member variables are initialized to appropriate default values to prevent undefined behavior.
Verification successful
The constructors of the
SuppAlignment
class have been verified, and all member variables are consistently initialized across the different constructors. There is no indication of potential undefined behavior due to uninitialized member variables.* 180-190: The `finalizeSupportingIndices` method has been updated to use standard algorithms for sorting and removing duplicates. Ensure that these changes are tested thoroughly to confirm that the behavior remains consistent with the previous implementation.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for any other constructors in the class to ensure consistency in member initialization. rg "SuppAlignment::SuppAlignment" src/SuppAlignment.cppLength of output: 225
Script:
#!/bin/bash # Extract the member initializers from all three constructors to compare them. ast-grep --lang cpp --pattern $'class $_ { $_() : $$$ { $$$ } $$$ }' src/SuppAlignment.cppLength of output: 446
Script:
#!/bin/bash # Refine the pattern to match the constructors of the SuppAlignment class specifically. ast-grep --lang cpp --pattern $'class $_ { SuppAlignment(string::const_iterator $_, string::const_iterator $_, $_) : $$$ { $$$ } SuppAlignment(int $_, $_) : $$$ { $$$ } SuppAlignment(const string& $_) : $$$ { $$$ } $$$ }' src/SuppAlignment.cppLength of output: 615
Script:
#!/bin/bash # Use ripgrep to extract the constructors and their member initializer lists. rg -A 10 "SuppAlignment::SuppAlignment" src/SuppAlignment.cppLength of output: 1720
Script:
#!/bin/bash # Extract the full member initializer list for the second constructor. rg -A 20 "SuppAlignment::SuppAlignment\(int chrIndexIn," src/SuppAlignment.cppLength of output: 1095
239-285: The
GlobalAppConfig
singleton. Verify that the372-386: The
saCloseness
method has been modified. Review the changes to ensure that the logic for determining closeness is correct and that the method is consistent with the class's intended functionality.389-403: The new
saDistHomologyRescueCloseness
method has been added. Ensure that this method's logic is correct and that it is consistent with the intended functionality of theSuppAlignment
class.src/SuppAlignmentAnno.cpp (6)
26-26: The inclusion of
GlobalAppConfig.h
indicates that theGlobalAppConfig
singleton is being used to accessChrConverter
. Ensure that the singleton implementation is thread-safe and correctly initializes theChrConverter
instance.38-54: The constructor now uses the
ChrConverter
fromGlobalAppConfig
. Verify that theChrConverter
instance is properly initialized before anySuppAlignmentAnno
objects are created to prevent any potential access to uninitialized state.167-212: The
ChrConverter
fromGlobalAppConfig
. This change should be cross-verified with other methods in the codebase to ensure consistency in howChrConverter
is accessed.216-236: The logic in
saCloseness
andsaClosenessDirectional
methods has been updated. Verify that these changes correctly implement the intended fuzzy matching behavior and that they do not introduce any regressions or unexpected behavior.263-285: The
mergeMrefSa
method has been updated to merge support values and supporting indices. Ensure that the merging logic is correct and handles all edge cases, such as whenexpectedDiscordants
is zero or when thesemiSuspicious
flag is set.289-296: The
finalizeSupportingIndices
method has been updated. Verify that the logic correctly finalizes the supporting indices without data loss or corruption, particularly the removal of duplicates and the calculation ofsupport
.src/sophia.cpp (7)
13-22: The inclusion of the
<memory>
header and new class headers is appropriate for the use of smart pointers and the new functionality related to genome builds.27-27: The use of the
sophia
namespace is added to simplify the code within themain
function.34-34: The addition of the
assemblyname
command line option is necessary for the new functionality to support different genome builds.124-136: The dynamic memory allocation for
ChrConverter
usingunique_ptr
is correctly implemented. However, ensure that the error message provided when an unknown assembly name is given is clear and instructive for the user.139-139: The initialization of the
GlobalAppConfig
singleton is correctly implemented using theinit
method.105-120: The use of
std::min
to limitAlignment::ISIZEMAX
andSuppAlignment::ISIZEMAX
is a good practice. Ensure that all error paths, such as when no insert size distribution file is given, return a non-zero exit code to indicate failure.154-154: The function call to
parseSamStream
is placed correctly at the end of themain
function, which seems to be the intended use of theSamSegmentMapper
class.src/sophiaAnnotate.cpp (8)
10-10: The inclusion of
<memory>
is appropriate for the use of smart pointers.39-39: The addition of the "assemblyname" command-line option aligns with the PR's objective to support multiple human genome builds.
128-138: The dynamic initialization of
ChrConverter
based on the "assemblyname" option is a crucial part of the new functionality to support different genome builds.141-142: The initialization of
GlobalAppConfig
usinginit
is consistent with the PR's goal of centralizing theChrConverter
.232-232: The
main
function correctly returns 0 to indicate successful execution.60-60: The
main
function correctly returns a non-zero value in case of errors, which is a good practice and addresses the previous comment.Also applies to: 68-68, 77-77, 85-85
207-213: The temporary setting of
SvEvent::NOCONTROLMODE
totrue
before and resetting tofalse
after the control check is a logical way to handle temporary state changes.215-224: The conditional instantiation of
AnnotationProcessor
based on the "controlresults" option is a logical change that supports the PR's goal of flexible handling of different analysis scenarios.Also applies to: 228-229
src/sophiaMref.cpp (1)
- 87-99: Verify that the
assemblyName
static members are correctly defined inHg37ChrConverter
andHg38ChrConverter
and are consistent with the expected values (hg37
,hg38
).
Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (4)
- Makefile (1 hunks)
- Release_sophia/build-sophia.sh (1 hunks)
- Release_sophiaAnnotate/build-sophiaAnnotate.sh (2 hunks)
- Release_sophiaMref/build-sophiaMref.sh (1 hunks)
Files skipped from review due to trivial changes (3)
- Makefile
- Release_sophiaAnnotate/build-sophiaAnnotate.sh
- Release_sophiaMref/build-sophiaMref.sh
Additional comments: 4
Release_sophia/build-sophia.sh (4)
17-17: The update to use the C++17 standard is consistent with modern C++ practices and should ensure better compatibility and use of newer language features.
19-21: The conditional addition of static linking options based on the
STATIC
environment variable is a good practice for flexible build configurations.27-30: The inclusion of new source files
GlobalAppConfig.cpp
,Hg37ChrConverter.cpp
, andHg38ChrConverter.cpp
aligns with the PR's objective to introduce support for the hg38 human genome build and refactor theChrConverter
class.37-50: The linking command has been correctly updated to include the new object files and respect the conditional static linking options. This ensures that the final executable is built with all necessary components.
Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
…main. Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
…ailing index mapping. It is unclear, why the mapping should fail. The optional guarantees that the calling code has to handle the situation. There are many places where the calling code implements a default action, and other placess (e.g. AnnotationProcessor.cpp:227) where an error should occur (used to be segmentation fault, now is exception). Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
…e_* directories removed. Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
Static compile with Makefile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 14
Configuration used: CodeRabbit UI
Files selected for processing (19)
- .gitignore (1 hunks)
- Makefile (1 hunks)
- README.md (1 hunks)
- include/ChrConverter.h (2 hunks)
- include/Hg37ChrConverter.h (1 hunks)
- include/Hg38ChrConverter.h (1 hunks)
- src/Alignment.cpp (5 hunks)
- src/AnnotationProcessor.cpp (8 hunks)
- src/BreakpointReduced.cpp (3 hunks)
- src/GlobalAppConfig.cpp (1 hunks)
- src/Hg37ChrConverter.cpp (1 hunks)
- src/MasterRefProcessor.cpp (2 hunks)
- src/MrefEntry.cpp (3 hunks)
- src/MrefEntryAnno.cpp (1 hunks)
- src/SuppAlignment.cpp (10 hunks)
- src/SuppAlignmentAnno.cpp (3 hunks)
- src/sophia.cpp (2 hunks)
- src/sophiaAnnotate.cpp (5 hunks)
- src/sophiaMref.cpp (1 hunks)
Files skipped from review due to trivial changes (2)
- .gitignore
- src/MrefEntryAnno.cpp
Additional comments: 47
src/GlobalAppConfig.cpp (2)
26-26: The inclusion of the
GlobalAppConfig.h
header is correct and necessary for the implementation of theGlobalAppConfig
class.31-32: The initialization of the static member
instance_
tonullptr
is correct and follows the singleton pattern.include/ChrConverter.h (2)
46-46: The destructor of
ChrConverter
is correctly declared as virtual, which is important for a class intended to be subclassed.77-77: The method
parseChrAndReturnIndex
is correctly declared as virtual, allowing for different implementations in derived classes.Makefile (2)
46-46: The note about
make
downloading a file from StrTk is helpful for users who may need to clean and re-download dependencies.50-50: The use of Conda to create an environment for building SOPHIA binaries is a good practice for managing dependencies.
include/Hg37ChrConverter.h (2)
19-19: The header guard is correctly defined and unique to the
Hg37ChrConverter
class.61-61: The constructor
Hg37ChrConverter()
is correctly declared, allowing for instantiation of the class.README.md (2)
3-3: The description of SOPHIA as a Structural Variant detection algorithm is clear and informative.
92-92: The upcoming version 35.1.0 is well-documented with clear explanations of the new features and improvements.
include/Hg38ChrConverter.h (2)
19-19: The header guard is correctly defined and unique to the
Hg38ChrConverter
class.74-74: The default constructor
Hg38ChrConverter()
is correctly declared, which is necessary for creating instances of the class.src/BreakpointReduced.cpp (2)
27-27: The inclusion of
GlobalAppConfig.h
is necessary for accessing the global application configuration.171-171: The
printOverhang
method correctly usesGlobalAppConfig
to access theChrConverter
instance, which is a good use of the singleton pattern.src/sophia.cpp (1)
- 13-13: The inclusion of
<memory>
is necessary for using smart pointers likeunique_ptr
.src/MasterRefProcessor.cpp (1)
- 26-26: The inclusion of
GlobalAppConfig.h
is necessary for accessing the global application configuration.src/MrefEntry.cpp (2)
25-25: The inclusion of
GlobalAppConfig.h
is necessary for accessing the global application configuration.56-61: The
addEntry
method correctly utilizesGlobalAppConfig
to obtainChrConverter
, which is a good practice for accessing shared resources.src/sophiaAnnotate.cpp (6)
10-10: The inclusion of
<memory>
is appropriate for the use of smart pointers likeunique_ptr
in the code.40-40: The addition of the "assemblyname" command-line option is consistent with the PR's objective to support different human genome builds.
58-58: Initialization of
vector<vector<MrefEntryAnno>> mref
with a size of 85 may be related to the expected number of chromosomes. Ensure that this size aligns with the data structure requirements for the genome build being used.71-71: The error message and return statement are appropriate for missing command-line arguments. This pattern is consistently used for all required options.
129-139: The logic to initialize
chrConverter
based on the "assemblyname" option is correct and aligns with the PR's objective to support different human genome builds.220-220: The main function correctly returns 0, indicating successful execution. This is consistent with standard practice for main functions in C++.
src/SuppAlignmentAnno.cpp (3)
26-26: The inclusion of "GlobalAppConfig.h" is necessary for accessing the
GlobalAppConfig
singleton, which is consistent with the changes described in the PR.53-54: The modification to use
GlobalAppConfig
for chromosome index parsing in theSuppAlignmentAnno
constructor aligns with the centralization ofChrConverter
within theGlobalAppConfig
singleton.183-189: The modification to use
GlobalAppConfig
for chromosome name retrieval in thesrc/SuppAlignment.cpp (4)
28-28: The inclusion of "GlobalAppConfig.h" is necessary for accessing the
GlobalAppConfig
singleton, which is consistent with the changes described in the PR.95-96: The modification to use
GlobalAppConfig
for chromosome index parsing in theSuppAlignment
constructor aligns with the centralization ofChrConverter
within theGlobalAppConfig
singleton.182-184: The modification to the
finalizeSupportingIndices
method is a reformatting that improves readability without changing the logic.248-260: The modification to use
GlobalAppConfig
for chromosome name retrieval in thesrc/Alignment.cpp (6)
27-27: The inclusion of "GlobalAppConfig.h" is necessary for accessing the
GlobalAppConfig
singleton, which is consistent with the changes described in the PR.47-66: The modification to use initializer list syntax in the
Alignment
constructor is a best practice in C++ for initializing member variables and is consistent with modern C++ standards.75-76: The modification to use
GlobalAppConfig
for chromosome index parsing in theAlignment
constructor aligns with the centralization ofChrConverter
within theGlobalAppConfig
singleton.176-177: The modification to use
GlobalAppConfig
for mate chromosome index parsing in thecontinueConstruction
method is correct and ensures that the chromosome conversion is centralized.729-746: The
fullMedianQuality
method is added to calculate the median quality of a sequence. Ensure that the logic for handling low-quality bases (consecutive bases belowBASEQUALITYTHRESHOLDLOW
) is correct and aligns with the intended behavior.750-762: The
getMedian
method is implemented correctly, usingnth_element
to find the median in linear time. The handling of both odd and even lengths is appropriate.src/Hg37ChrConverter.cpp (3)
331-345: The constructor
Hg37ChrConverter
correctly validates the size of the input vectors to ensure they match the expected sizes. This is good practice for ensuring data integrity.353-354: The method
nChromosomes
correctly returns the number of chromosomes based on the size of theindexToChr
vector. This is a straightforward and expected implementation.362-367: The method
indexToChrName
includes proper error handling for invalid chromosome indices by throwing a runtime error. This is a good practice for robust error handling.src/AnnotationProcessor.cpp (7)
28-28: The inclusion of
GlobalAppConfig.h
is necessary for accessing theGlobalAppConfig
singleton and itsChrConverter
instance.61-73: Proper use of
GlobalAppConfig
to accessChrConverter
and checking for the presence of a value in the optional return type before using it.129-132:
246-251:
322-328: Correct implementation of accessing
ChrConverter
throughGlobalAppConfig
and checking for the presence of a value in the optional return type before using it.560-567: Proper use of
GlobalAppConfig
to accessChrConverter
and checking for the presence of a value in the optional return type before using it.700-706: Correct implementation of accessing
ChrConverter
throughGlobalAppConfig
and checking for the presence of a value in the optional return type before using it.
…me(ChrIndex). * This replace the dubious `chrIndex < 1002` and `chrIndex > 1001` meaning "not p* or INVALID" or "p* or INVALID", with `p*` actually meaning phiX174 in Hg37ChrConverter. * Wrap the strtk.hpp for C++11 into strkt-wrap.h that turns off all warnings for strtk.hpp, because we use C++17 for the rest of the code. * Further Makefile improvements. * More code-comments and small refactorings to improve readability. Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
…d via that, also SuppAlignmentAnno are copied 3.7 billion times into memory during sophiaMref. Changed a lots of C-style casts into C++ style casts (static_cast). Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
…ments. Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
…using emplace_back. This will eventually reserve less memory, and avoid repeated reallocation of gigabytes of memory. Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
…dex < 23) actually includes autosomes and X chromosome, but excludes the Y chromosome. Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
src/SuppAlignment.cpp
Outdated
} | ||
if (chrIndex == rhs.getChrIndex() && encounteredM == rhs.isEncounteredM()) { | ||
if (strictFuzzy || rhs.isStrictFuzzy()) { | ||
return (rhs.getPos() - fuzziness) <= (extendedPos + fuzziness) && (pos - fuzziness) <= (rhs.getExtendedPos() + fuzziness); | ||
return (rhs.getPos() - fuzziness) <= (extendedPos + fuzziness) && | ||
(pos - fuzziness) <= (rhs.getExtendedPos() + fuzziness); | ||
} else { | ||
return abs(pos - rhs.getPos()) <= fuzziness; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please
src/Hg37ChrConverter.cpp
Outdated
std::string::const_iterator start, | ||
std::string::const_iterator end, | ||
char stopChar, | ||
const std::string &stopCharExt[[gnu::unused]] // Attribute to remove the warning | ||
) const { | ||
ChrIndex chrIndex {0}; | ||
/* if (start == end) { | ||
throw_with_trace(DomainError("Chromosome identifier is empty.")); | ||
} else */ if (isdigit(*start)) { | ||
for (auto chr_cit = start; chr_cit != end && *chr_cit != stopChar; ++chr_cit) { | ||
chrIndex = chrIndex * 10 + ChrIndex(*chr_cit - '0'); | ||
} | ||
} else { | ||
switch (*start) { | ||
case 'h': | ||
chrIndex = hg37::decoyIndex; | ||
break; | ||
case 'X': | ||
chrIndex = hg37::xIndex; | ||
break; | ||
case 'G': // Match GL...... chromosomes. | ||
for (auto cit = next(start, 2); *cit != '.'; ++cit) { | ||
chrIndex = 10 * chrIndex + ChrIndex(*cit - '0'); | ||
} | ||
break; | ||
case 'Y': | ||
chrIndex = hg37::yIndex; | ||
break; | ||
case 'M': // Match "MT" | ||
++start; | ||
if (start != end && *start == 'T') { | ||
chrIndex = hg37::mtIndex; | ||
} else { | ||
throw_with_trace( | ||
DomainError("Chromosome identifier with invalid prefix 'M" + | ||
std::to_string(*start) + "'.")); | ||
} | ||
break; | ||
case 'N': | ||
chrIndex = hg37::decoyIndex; | ||
break; | ||
case 'p': | ||
chrIndex = hg37::phixIndex; | ||
break; | ||
default: | ||
throw_with_trace(DomainError("Chromosome identifier with invalid prefix '" | ||
+ std::to_string(*start) + "'.")); | ||
} | ||
} | ||
return chrIndex; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Restore reference behaviour for sophiaMref
… with CompressedMref chromosomes (that I thought would not occur at that place). Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
Signed-off-by: Philip R. Kensche <p.kensche@dkfz-heidelberg.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 42
Configuration used: CodeRabbit UI
Files ignored due to path filters (3)
resources/hg38.tsv
is excluded by:!**/*.tsv
resources/hg38_test.tsv
is excluded by:!**/*.tsv
resources/hs37d5+phix.tsv
is excluded by:!**/*.tsv
Files selected for processing (46)
- Makefile (1 hunks)
- README.md (1 hunks)
- include/Alignment.h (2 hunks)
- include/Breakpoint.h (1 hunks)
- include/BreakpointReduced.h (2 hunks)
- include/ChosenBp.h (1 hunks)
- include/ChrCategory.h (1 hunks)
- include/ChrConverter.h (2 hunks)
- include/GenericChrConverter.h (1 hunks)
- include/Hg37ChrConverter.h (1 hunks)
- include/IndexRange.h (1 hunks)
- include/MasterRefProcessor.h (2 hunks)
- include/MrefEntry.h (1 hunks)
- include/MrefEntryAnno.h (2 hunks)
- include/MrefMatch.h (1 hunks)
- include/SamSegmentMapper.h (2 hunks)
- include/SuppAlignment.h (1 hunks)
- include/SuppAlignmentAnno.h (2 hunks)
- include/SvEvent.h (2 hunks)
- include/global.h (1 hunks)
- src/Alignment.cpp (1 hunks)
- src/AnnotationProcessor.cpp (27 hunks)
- src/ChrCategory.cpp (1 hunks)
- src/ChrConverter.cpp (2 hunks)
- src/DeFuzzier.cpp (1 hunks)
- src/GenericChrConverter.cpp (1 hunks)
- src/Hg37ChrConverter.cpp (1 hunks)
- src/IndexRange.cpp (1 hunks)
- src/MasterRefProcessor.cpp (1 hunks)
- src/MateInfo.cpp (1 hunks)
- src/MrefEntry.cpp (1 hunks)
- src/MrefMatch.cpp (1 hunks)
- src/SamSegmentMapper.cpp (1 hunks)
- src/Sdust.cpp (1 hunks)
- src/SuppAlignment.cpp (1 hunks)
- src/SuppAlignmentAnno.cpp (1 hunks)
- src/sophia.cpp (1 hunks)
- src/sophiaAnnotate.cpp (1 hunks)
- src/sophiaMref.cpp (1 hunks)
- tests/Breakpoint_test.cpp (1 hunks)
- tests/ChrCategory_test.cpp (1 hunks)
- tests/ChrInfoTable_test.cpp (1 hunks)
- tests/GenericChrConverter_test.cpp (1 hunks)
- tests/Hg37ChrConverter_test.cpp (1 hunks)
- tests/IndexRange_test.cpp (1 hunks)
- tests/SuppAlignmentAnno_test.cpp (1 hunks)
Files not summarized due to errors (2)
- src/Alignment.cpp: Error: Message exceeds token limit
- src/Hg37ChrConverter.cpp: Error: Message exceeds token limit
Files skipped from review due to trivial changes (1)
- include/MrefMatch.h
Additional comments: 117
tests/Hg37ChrConverter_test.cpp (1)
- 1-16: The unit test for
Hg37ChrConverter
appears well-structured and covers the expected functionality. It's good to see comprehensive testing being added alongside the new features.Remember to ensure consistent error handling across all related components, as indicated by previous adjustments in
Hg37ChrConverter.cpp
.src/IndexRange.cpp (1)
- 1-29: The
IndexRange
class is well-implemented, with clear documentation and appropriate error handling in the constructor. Good job on ensuring data integrity and providing helpful comments for understanding the range behavior.tests/IndexRange_test.cpp (1)
- 1-30: Comprehensive tests for the
IndexRange
class, covering constructor behavior, range containment, and width calculation. The use ofEXPECT_THROW
to test exception handling is particularly commendable.include/IndexRange.h (1)
- 1-41: The
IndexRange
class is well-documented, with clear explanations of its behavior and usage. The interface is appropriately designed, withconst
correctness applied to methods that do not modify the object state.tests/Breakpoint_test.cpp (1)
- 1-24: The unit tests for the
Breakpoint
class'sparse
method are well-structured and effectively validate parsing functionality. It's good to see integration with the global configuration and chromosome conversion functionality being tested.Please ensure to address the TODO comment regarding tests for breakpoint serialization to ensure comprehensive testing coverage.
src/ChrConverter.cpp (1)
- 13-40: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1-38]
Significant enhancements to the
ChrConverter
class, including the addition of a constructor, destructor, and thegetAssemblyName
method. The introduction of theisInBlockedRegion
method suggests potential for future functionality. These changes align well with the PR's objectives to enhance flexibility and maintainability.src/MrefMatch.cpp (1)
- 30-37: The reformatting of the
MrefMatch
class constructor's parameters improves readability and maintains consistency. This is a good practice for code clarity and maintainability.include/MasterRefProcessor.h (1)
- 41-72: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1-68]
Significant updates to the
MasterRefProcessor
class, including new constructor parameters and a detailed comment about themrefDb
data structure. The use ofChrSize
forDEFAULT_READ_LENGTH
and the inclusion of the "global.h" header file are good practices for type consistency and clarity. These changes align well with the PR's objectives to enhance maintainability and flexibility.tests/ChrInfoTable_test.cpp (1)
- 10-54: The unit tests in
ChrInfoTable_test.cpp
are well-structured and cover a variety of cases for chromosome names and categories. It's good practice to include such comprehensive tests, especially when dealing with genomic data where correctness is crucial.include/ChosenBp.h (1)
- 40-58: The constructor of
ChosenBp
class is well-documented, especially the parameterselfNodeIndexIn
. It's important for maintainability and clarity to have such detailed comments, especially in complex domains like genomic data processing.include/SamSegmentMapper.h (1)
- 45-70: The updates to
SamSegmentMapper
, including the use ofChrSize
for parameters and the renaming of members for clarity, are positive changes that enhance readability and maintainability. It's important to ensure that these changes are consistently applied throughout the codebase.tests/ChrCategory_test.cpp (1)
- 8-55: The unit tests for
ChrCategory
class inChrCategory_test.cpp
are comprehensive, covering various scenarios for string conversions and category properties. This thorough testing is crucial for ensuring the reliability of chromosome category handling.include/global.h (1)
- 21-22: The header guards in
global.h
are correctly formatted and follow the convention, addressing the previous comment about avoiding starting with an underscore. This change enhances portability and compliance with standards.src/MateInfo.cpp (2)
- 6-28: The implementation of comparison operators and the
suppAlignmentFuzzyMatch
method inMateInfo
class are logically sound and correctly handle the comparison of mate information. These methods are crucial for accurately processing alignment data.- 30-71: The constructors of
MateInfo
class are well-implemented, providing flexibility for initializing instances with or without breakpoint locations. This flexibility is important for handling various scenarios in genomic data processing.src/ChrCategory.cpp (1)
- 13-91: The implementation of the
ChrCategory
class insrc/ChrCategory.cpp
is well-structured, providing a rich enumeration for chromosome categories. The use of a static unordered map for categories and utility methods enhances the usability and maintainability of chromosome category handling.include/ChrCategory.h (1)
- 1-115: The declaration of the
ChrCategory
class ininclude/ChrCategory.h
is comprehensive, providing a rich enumeration for chromosome categories. The use of custom hash and equality functions enables the use ofChrCategory
instances in unordered containers, which is a thoughtful design choice for performance and usability.include/MrefEntry.h (4)
- 46-46: The introduction of
ValidityScore
as a type alias forsigned char
is a good practice for clarity and maintainability. It makes the purpose of the variable more explicit and allows for easier changes in the future if the underlying type needs to be adjusted.- 48-52: The use of static members
NUM_PIDS
,DEFAULT_READ_LENGTH
, anddoubleFormatter
is appropriate for values that are constant across all instances ofMrefEntry
. However, ensure that these static members are initialized properly in the implementation file to avoid linker errors.Verification successful
The initialization of the static members
NUM_PIDS
,DEFAULT_READ_LENGTH
, anddoubleFormatter
inMrefEntry.cpp
has been verified. Specifically,doubleFormatter
is initialized with a format string"%.5f"
, which is appropriate. However, bothNUM_PIDS
andDEFAULT_READ_LENGTH
are default-initialized without explicit values. This is technically correct but without additional context on the intended use or required initial values for these members, it's difficult to definitively assess if this meets the intended design requirements. It may be worth confirming that default-initialization is the desired behavior forNUM_PIDS
andDEFAULT_READ_LENGTH
.* 75-82: The method `removeMarkedFuzzies` efficiently removes `SuppAlignmentAnno` objects marked for removal from `suppAlignments`. Using `std::remove_if` followed by `erase` is a canonical way to remove elements from a container based on a condition. This approach maintains the order of elements and ensures efficient removal. * 106-108: Setting an `MrefEntry` as invalid by assigning `std::numeric_limits::max()` to `pos` and `-1` to `validity` is a clear way to mark an entry as unusable. However, ensure that all methods that interact with `pos` or `validity` properly handle these special values to avoid unexpected behavior.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify initialization of static members in the implementation file rg "unsigned int MrefEntry::NUM_PIDS" src/ rg "ChrSize MrefEntry::DEFAULT_READ_LENGTH" src/ rg "boost::format MrefEntry::doubleFormatter" src/Length of output: 349
include/MrefEntryAnno.h (3)
- 41-43: The addition of
PIDS_IN_MREF
andDEFAULT_READ_LENGTH
constants is a positive change for code clarity and maintainability. It centralizes the configuration of these values, making it easier to adjust them as needed. Ensure these constants are used consistently throughout the codebase.- 50-54: The use of
ChrSize
for position-related calculations instead ofint
is a good practice for type safety and clarity. It ensures that the variables are used for their intended purpose and reduces the risk of type-related errors.- 103-107: The method
closeToSupp
usesChrDistance
for fuzziness calculations, which is a clear and type-safe approach. However, ensure that thefuzziness
parameter is always passed correctly and that the method's behavior is well-documented for users.src/sophiaMref.cpp (2)
- 28-50: The handling of command-line arguments using
boost::program_options
is well-implemented, providing a clear and user-friendly way to configure the application. Ensure that all command-line options are documented in the help message and that their usage is consistent with the application's requirements.- 112-114: Setting the
DEFAULT_READ_LENGTH
forSuppAlignment
andSuppAlignmentAnno
based on the command-line argument is crucial for ensuring that the application uses the correct read length. This approach centralizes the configuration and makes it easily adjustable.tests/SuppAlignmentAnno_test.cpp (1)
- 14-75: The comprehensive testing of
SuppAlignmentAnno
's parsing constructor and its properties is commendable. It ensures that the class behaves as expected under various conditions. Consider adding tests for edge cases and error conditions to further increase the robustness of the class.src/Sdust.cpp (2)
- 32-55: The constructor of
Sdust
has been significantly refactored to handle window sizes and regions more effectively. The use of a loop to process each window and the conditional logic for finding perfect regions are well-implemented. Ensure that the logic correctly handles edge cases and that the window size (WINDOW_SIZE
) and score threshold (SCORE_THRESHOLD
) are appropriately configured.- 85-109: The method
findPerfectRegions
has been enhanced for better scoring and interval calculations. The use of a reverse iterator to traverse the window and the conditional insertion into theP
set based on scores are good practices. However, ensure that the calculation ofnewScore
is correct and that the method's behavior is thoroughly tested.include/SuppAlignmentAnno.h (4)
- 29-29: The inclusion of
global.h
is a positive change, as it likely centralizes global definitions and configurations used across the project. Ensure that this header file is properly maintained and that its contents are relevant to the files that include it.- 65-65: Renaming
DEFAULTREADLENGTH
toDEFAULT_READ_LENGTH
improves readability and follows the naming conventions more closely. This change makes the codebase more consistent and easier to understand.- 69-71: The method
extendSuppAlignment
usesChrSize
for position-related calculations, which is a good practice for type safety. The use ofmin
andmax
functions to adjust the positions ensures that the alignment is correctly extended. Ensure that this method is called with accurate parameters to avoid logical errors.- 74-77: The method
saClosenessDirectional
introduces directional closeness checks between supplementary alignments. This addition is important for handling alignments more accurately in certain contexts. Ensure that the fuzziness parameter is used correctly and that the method is tested thoroughly.include/Alignment.h (6)
- 31-31: Including
"global.h"
here is essential for ensuring that global definitions and utilities are accessible throughout this file. This is a good practice, especially in a project where certain configurations or definitions are centralized.- 50-50: The introduction of
LOW_QUAL_CLIP_THRESHOLD
as a static variable is a positive change, assuming it is used to define a threshold for low-quality clipping across all instances ofAlignment
. However, it's important to ensure that this variable is initialized somewhere, preferably in a source file, to avoid linker errors.Verification successful
The introduction of
LOW_QUAL_CLIP_THRESHOLD
as a static variable inAlignment
and its initialization insrc/sophia.cpp
have been confirmed. The variable is properly declared and initialized, and it is used across the codebase as intended for defining a threshold for low-quality clipping. This aligns with the expectations set in the original comment, ensuring that the variable serves its purpose without causing linker errors.* 52-57: The use of static variables for thresholds (`BASE_QUALITY_THRESHOLD`, `BASE_QUALITY_THRESHOLD_LOW`, `CLIPPED_NUCLEOTIDE_COUNT_THRESHOLD`, `INDEL_NUCLEOTIDE_COUNT_THRESHOLD`) is a good practice for consistency across instances. Ensure these are properly initialized and consider documenting their purpose and how the values were determined to aid future maintainability. * 59-59: The introduction of `ISIZEMAX` as a static variable suggests it's used globally across all `Alignment` instances. It's crucial to ensure this variable is initialized and documented, explaining its role in the application. This could be related to insert size calculations or thresholds. * 168-170: The introduction of `lowMapq` and `nullMapq` boolean member variables is a good practice for encapsulating the mapping quality state within an `Alignment` object. Ensure that these are initialized in the constructor to avoid undefined behavior.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for initialization of LOW_QUAL_CLIP_THRESHOLD in .cpp files rg "LOW_QUAL_CLIP_THRESHOLD" --type cppLength of output: 554
Verification successful
The initialization of
lowMapq
andnullMapq
boolean member variables in theAlignment
constructor, as shown in the provided script output, confirms that these variables are properly initialized to avoid undefined behavior. This aligns with the good practice mentioned in the review comment.* 174-174: Using `unique_ptr` for `chosenBp` is an excellent choice for memory management, ensuring that the memory is automatically freed when the `Alignment` object is destroyed. This helps prevent memory leaks in the application.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for constructor initialization of lowMapq and nullMapq in .cpp files rg "Alignment::Alignment" --type cppLength of output: 81
Script:
#!/bin/bash # Display the constructor implementation in Alignment.cpp rg "Alignment::Alignment" -A 10 --type cpp src/Alignment.cppLength of output: 333
include/ChrConverter.h (6)
- 50-50: The constructor
ChrConverter(const std::string &assemblyNameIn)
introduces a way to instantiateChrConverter
objects with a specific assembly name. This is a good practice for flexibility and maintainability, allowing the class to adapt to different genome assemblies.- 52-52: The presence of a virtual destructor in
ChrConverter
is crucial for a class intended to be subclassed. This ensures that the destructor of derived classes is called correctly when an object is deleted through a pointer to the base class, preventing resource leaks.- 54-54:
getAssemblyName
provides a way to retrieve the assembly name associated with aChrConverter
instance. This is useful for debugging and logging purposes, as well as for any functionality that may behave differently based on the assembly.- 57-112: The introduction of pure virtual functions for chromosome index mapping, type identification, and handling compressed master-ref chromosomes significantly enhances the
ChrConverter
class's flexibility and utility. This design allows for easy extension to support different genome assemblies and chromosome naming conventions. Ensure that derived classes implement these methods correctly and consider adding unit tests to verify their behavior across various scenarios.- 121-121: The method
isInBlockedRegion
is declared but not defined as pure virtual, suggesting it has a default implementation. Ensure that this implementation is provided and correctly identifies blocked regions, which could be critical for filtering out alignments to certain genomic regions.Verification successful
The review comment suggested verifying whether the
isInBlockedRegion
method, declared ininclude/ChrConverter.h
, has a default implementation or if it's defined as pure virtual. The verification process revealed that:
- The
isInBlockedRegion
method is indeed declared in the base classChrConverter
as a virtual function, implying it can be overridden by derived classes.- The
Hg37ChrConverter
class, a derived class, provides a specific implementation of this method, as seen insrc/Hg37ChrConverter.cpp
.- The base class
ChrConverter
does provide a default implementation of theisInBlockedRegion
method insrc/ChrConverter.cpp
, which simply returnsfalse
. This means that, by default, no regions are considered blocked unless overridden by a derived class.Given this information, the method
isInBlockedRegion
is correctly declared as virtual (not pure virtual) in the base class and has a default implementation that returnsfalse
. Derived classes, such asHg37ChrConverter
, are free to override this method to provide specific logic for identifying blocked regions. Therefore, the review comment is addressed, and the implementation aligns with the expectations for handling blocked regions in genomic alignments.* 148-152: The method `parseChrAndReturnIndex` is a powerful addition, allowing for flexible parsing of chromosome identifiers from various formats. This is particularly useful when dealing with different data sources or file formats. Ensure that the implementation handles edge cases and errors gracefully, possibly through comprehensive unit testing.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for the implementation of isInBlockedRegion in .cpp files rg "isInBlockedRegion" --type cppLength of output: 659
Script:
#!/bin/bash # Check if ChrConverter.cpp contains a default implementation of isInBlockedRegion cat src/ChrConverter.cppLength of output: 1394
Makefile (5)
- 10-13: The compiler flags include
-Wsign-conversion
to warn about conversions between signed and unsigned types. This is a good practice for catching potential issues early. However, ensure that this does not introduce too many warnings that could clutter the build output and make it harder to spot actual issues.- 15-21: The conditional logic for static linking (
ifeq ($(static),true)
) is clear and well-implemented. This allows for easy toggling of static vs. dynamic linking based on thestatic
variable. Just ensure that theboost_lib_dir
variable is correctly set when static linking is enabled, as it's crucial for locating the Boost libraries.Verification successful
The new context provided from the shell script output reveals that
boost_lib_dir
is indeed referenced within the repository, specifically in theREADME.md
and theMakefile
. TheREADME.md
file provides clear instructions on how to set theboost_lib_dir
variable, including examples of its usage in make commands for different scenarios (building, developing, testing). This indicates that the mechanism for settingboost_lib_dir
is documented and intended to be passed as an argument tomake
commands.Given this information, the review comment about ensuring that the
boost_lib_dir
variable is correctly set when static linking is enabled is verified as accurate. The documentation inREADME.md
supports the review comment by providing explicit guidance on how to setboost_lib_dir
for static linking.Therefore, the review comment is consistent with the codebase and the provided documentation.
* 23-32: The development flags (`ifeq ($(develop),true)`) include `-O0` and `-ggdb3` for easier debugging and `-DDEBUG` for enabling debug-specific code paths. This setup is beneficial for development builds. Just be cautious about the potential performance impact of `-O0` in larger projects and consider using higher optimization levels for release builds. * 60-61: Retrieving `strtk.hpp` using `wget` directly from a GitHub raw content URL ensures that the latest version is used. However, this approach might lead to inconsistencies or build failures if the file is updated or moved. Consider using a versioned dependency management system or including a specific version of the file in the repository. * 64-65: Similar to the comment on lines 60-61, directly fetching `rapidcsv.h` using `wget` can lead to potential build inconsistencies. It's recommended to manage external dependencies in a more controlled manner to ensure build reproducibility.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that boost_lib_dir is set and points to a valid directory if [ -n "$boost_lib_dir" ] && [ -d "$boost_lib_dir" ]; then echo "boost_lib_dir is set and valid." else echo "Please ensure boost_lib_dir is set to the Boost library directory for static linking." fiLength of output: 196
Script:
#!/bin/bash # Search for references to boost_lib_dir in the repository rg "boost_lib_dir"Length of output: 404
include/Hg37ChrConverter.h (3)
- 19-20: The header guards are correctly formatted and do not start with an underscore, following the best practices to avoid conflicts with reserved identifiers. This aligns with the user's previous action to update header guards as suggested.
- 30-32: The comment describing the
Hg37ChrConverter
class provides clear context on its purpose, which is to encapsulate the implementation details specific to the hg37 assembly. This is a good practice for maintainability and understanding the class's role within the larger codebase.- 151-156: The method
closeToSupp
uses a hardcoded multiplier (2.5
) for theDEFAULT_READ_LENGTH
when determining fuzziness for fuzzy secondary alignments. While this may be appropriate for the current use case, consider making this multiplier configurable or documenting the rationale behind this specific value to improve code clarity and flexibility.include/GenericChrConverter.h (3)
- 19-20: The header guards are correctly formatted and do not start with an underscore, following the best practices to avoid conflicts with reserved identifiers. This ensures that the header file is included only once during compilation.
- 49-52: The use of
boost::unordered::unordered_map
forChrToIndexMap
andCompressedMrefChrToIndexMap
is noted. While Boost provides robust and efficient implementations, consider if the standardstd::unordered_map
could suffice for your use case. This could potentially reduce the dependency on Boost unless specific Boost features are required, simplifying the build process and reducing compile times.- 97-101: The constructor
GenericChrConverter(std::string assemblyName, ChrInfoTable chrInfo)
is well-designed, allowing for flexible initialization with different types of contig/chromosome names and sizes. This design supports extensibility for different genome assemblies. Ensure that theassemblyName
parameter is used appropriately within the class to distinguish between different assemblies if necessary.include/BreakpointReduced.h (4)
- 54-55: The constructor
BreakpointReduced(const Breakpoint &tmpBp, int lineIndexIn, bool hasOverhangIn)
is well-documented, indicating its purpose and the parameters it accepts. This clarity is beneficial for understanding the class's instantiation process. Ensure that the parameters are used correctly within the constructor to initialize the class members appropriately.- 95-95: The method
getChrIndex()
correctly returns the chromosome index as aChrIndex
type, aligning with the changes made to the return type. This enhances type safety and clarity in the code. Ensure that all usages of this method are updated to handle theChrIndex
type correctly.- 151-156: The method
closeToSupp
uses a hardcoded multiplier (2.5
) for theDEFAULT_READ_LENGTH
when determining fuzziness for fuzzy secondary alignments. While this may be appropriate for the current use case, consider making this multiplier configurable or documenting the rationale behind this specific value to improve code clarity and flexibility.- 165-185: The method
distanceToSupp
calculates the distance to a supplementary alignment, considering whether the alignment is fuzzy. This method is crucial for determining the proximity of breakpoints to supplementary alignments. Ensure that the logic correctly handles all cases, especially the edge cases around fuzzy alignments. The use of1000000
as a special value for different chromosomes might benefit from being defined as a constant for clarity.tests/GenericChrConverter_test.cpp (12)
- 10-19: The test case
GenericChrConverterTest_chrNameToIndex
correctly verifies the conversion from chromosome names to their respective indices. However, it's recommended to also include negative test cases to ensure that the converter handles invalid chromosome names appropriately.- 22-25: The test case
GenericChrConverter_assemblyName
effectively checks if the assembly name is correctly retrieved as "hg38". This is a straightforward validation and seems to cover the intended functionality well.- 28-31: The test case
GenericChrConverter_nChromosomes
properly validates the total number of chromosomes. Including a comment to explain the significance of the number 3367 could enhance readability and maintainability.- 34-43: The test case
GenericChrConverterTest_indexToChrName
accurately tests the conversion from indices to chromosome names. It's comprehensive, covering a range of chromosome types. No improvements needed here.- 58-61: The test case
GenericChrConverterTest_nChromosomesCompressedMref
checks the number of chromosomes in the compressed Mref, which is a specific scenario. It's good to have such focused tests, but adding a brief comment explaining the context or significance of the number 3365 would be helpful.- 64-78: The test case
GenericChrConverterTest_is_category
thoroughly tests various chromosome categories. This is a comprehensive test that covers a wide range of scenarios, ensuring the categorization logic is correctly implemented.- 80-89: The test case
GenericChrConverter_isCompressedMrefIndex
effectively checks the logic for determining if an index belongs to the compressed Mref. It's clear and covers both positive and negative scenarios.- 92-100: The test case
GenericChrConverter_compressedMrefIndexToIndex
validates the conversion from compressed Mref indices to global indices. Including a brief explanation for the logic behind the specific index values used in the test could enhance clarity.- 103-115: The test case
GenericChrConverter_indexToCompressedMrefIndex
checks the conversion from global indices to compressed Mref indices. It's well-structured and covers important scenarios, including error handling.- 118-123: The test case
GenericChrConverterTest_ParseSimpleStrings
demonstrates parsing functionality for simple chromosome strings. It's a straightforward test that validates an essential part of the chromosome conversion process.- 126-133: The test case
GenericChrConverter_chrSizeCompressedMref
checks chromosome sizes within the compressed Mref. It would be beneficial to include a source or reference for the specific size values used in the test to ensure accuracy and maintainability.- 136-152: The test case
GenericChrConverterTest_ParseBreakPointStrings
effectively tests the parsing of complex breakpoint strings. It's comprehensive and covers various formats, ensuring robust parsing functionality.src/DeFuzzier.cpp (4)
- 34-47: The method
deFuzzyDb
processes a vector ofBreakpointReduced
objects to handle fuzzy supplementary alignments. It iterates through breakpoints and their supplementary alignments, applyingprocessFuzzySa
andremoveMarkedFuzzies
to clean up fuzzy alignments. This method is well-structured and follows good practices in handling the fuzziness in alignments. However, ensure that theprocessFuzzySa
method is thoroughly tested, especially for edge cases where alignments might be close to theMAX_DISTANCE
threshold.- 49-60: The
processFuzzySa
method processes a single supplementary alignment annotation (SuppAlignmentAnno
) for fuzziness by performing forward and backward database sweeps (dbSweep
) and then selecting the best supplementary alignment (selectBestSa
). The logic appears sound, but it's crucial to verify the order of forward and backward sweeps, especially in lines 53-58, to ensure it aligns with the intended logic for handling alignments with encountered matches (M
) differently.- 127-151: The method
deFuzzyDb
for processingMrefEntry
objects includes logic for handling fuzzy or strict fuzzy supplementary alignments. It's important to ensure that the conditions for continuing or processing alignments (lines 133-137) are correctly implemented and tested. Additionally, the method's handling of invalid entries (lines 142-144) based on validity score and supplementary alignments presence is a good practice for maintaining data integrity.- 153-170: In the
processFuzzySa
method forMrefEntry
objects, the inclusion offileIndices
as a parameter fordbSweep
calls (lines 163-167) introduces a more complex handling of supplementary alignments based on file indices. This approach allows for a nuanced processing of alignments, potentially improving the accuracy of fuzzy alignment handling. However, it's essential to validate the logic and performance implications of this added complexity, especially in scenarios with a large number of file indices.include/Breakpoint.h (2)
- 46-47: The constructor of the
Breakpoint
class has been updated to acceptChrIndex
andChrSize
types forchrIndexIn
andposIn
, respectively. This change enhances type safety and clarity, aligning with the objective of using specific data types for chromosome indices and positions. Ensure that all usages of theBreakpoint
constructor throughout the codebase have been updated to pass the correct types.- 54-62: The introduction of constants such as
PERMISSIBLE_MISMATCHES
,MAX_PERMISSIBLE_SOFTCLIPS
, andBP_SUPPORT_THRESHOLD
improves the readability and maintainability of the code by replacing magic numbers with descriptive names. This change follows best practices for code clarity. Verify that these constants are used consistently across the codebase wherever applicable.src/MrefEntry.cpp (3)
- 52-142: The
addEntry
method in theMrefEntry
class has been significantly updated to include more detailed logic for handling breakpoints and supplementary alignments. The method now incorporates artifact ratio calculations and validity checks based on various conditions. This comprehensive approach improves the accuracy of entry processing. However, the complexity of the method has increased, so it's crucial to ensure that it is well-documented and that unit tests cover the new logic extensively.- 144-161: The
mergeMrefEntries
method combines twoMrefEntry
objects, merging their properties and supplementary alignments. This method is crucial for consolidating entries from different sources. Ensure that the merging logic correctly handles overlapping or duplicate supplementary alignments, especially when updating thevalidity
score. Consider adding unit tests to verify the correct behavior of this method under various scenarios.- 221-233: The addition of the
searchFuzzySa
method enhances theMrefEntry
class's ability to search for a matching fuzzy supplementary alignment. This method is a valuable addition for improving the handling of fuzzy alignments. Ensure that the logic for determining closeness (saClosenessDirectional
) is accurately implemented and tested, particularly in handling edge cases where alignments are near the fuzziness threshold.src/MasterRefProcessor.cpp (4)
- 39-45: The constructor's documentation mentions side effects related to file reading and writing, which is generally discouraged. Consider separating file I/O operations from the constructor to improve code clarity and maintainability.
- 58-74: Preallocating memory for
mrefDb
based on the genome size is a good performance optimization. However, ensure that the system running this code has sufficient memory resources to handle this allocation, especially for large genomes like hg38.- 135-148: The logging of file processing time and breakpoints is useful for debugging and performance monitoring. However, consider making the logging level configurable to avoid verbose output in production environments.
- 247-261: The
processBp
method correctly updates themrefDb
with new entries. Ensure that the method's performance is optimized for large datasets by profiling and optimizing any bottlenecks.src/sophiaAnnotate.cpp (3)
- 27-43: The main function is well-organized with clear separation of command-line parsing and application logic. Consider adding more comments to explain the purpose of each block for better readability.
- 120-121: The error message "No mref file given, exiting" could be more informative by suggesting how to provide the mref file or pointing to documentation for further assistance.
- 132-136: The error message "number of PIDS in the MREF not given, exiting" could be improved by providing more context or guidance on how to specify the number of PIDs in the MREF, enhancing user experience.
src/SuppAlignmentAnno.cpp (2)
- 158-160: Setting
fuzzy
totrue
whensupport + secondarySupport == 0
is a significant change in logic. Ensure this behavior is intended and documented, as it could affect downstream processing of alignments.Please verify that this change in the logic for determining
fuzzy
status aligns with the intended behavior and update documentation accordingly.
- 251-251: The existing comment from a previous review about the incorrect assignment of
defaultReadLengthControl
has been addressed in this version of the code. However, it's important to ensure that similar issues are not present elsewhere in the codebase.The correction for
defaultReadLengthControl
is noted and appreciated. Please ensure similar diligence in other parts of the code to prevent similar issues.src/SuppAlignment.cpp (2)
- 69-106: The static factory method
create
provides a clear and concise way to instantiateSuppAlignment
objects. However, the method's logic is complex and could benefit from additional comments explaining the rationale behind certain decisions, such as the handling oforiginIndexIn
and the calculation ofstrictFuzzy
.Consider adding more detailed comments within the
create
method to explain the logic and decisions made, enhancing maintainability and readability.
- 493-504: Given the user's previous clarification that changes to the
saCloseness
method were only related to code layout and not logic, it's important to ensure that the refactoring has not inadvertently introduced any logic changes.The layout changes in the
saCloseness
method appear to be correctly implemented without altering the underlying logic. Please ensure comprehensive testing to confirm this.src/SamSegmentMapper.cpp (6)
- 33-45: The constructor of
SamSegmentMapper
has been updated to useChrSize
forDISCORDANT_LEFT_RANGE
,DISCORDANT_RIGHT_RANGE
,minPos
, andmaxPos
. This change aligns with the PR objectives to useChrSize
instead ofint
for certain parameters, enhancing type safety and clarity.- 48-72: In the
parseSamStream
method, the logic to skip certain alignments based onchrConverter
conditions has been added. This change improves the method's ability to filter out irrelevant alignments, enhancing performance and correctness. However, ensure that the conditions used for skipping alignments are well-documented and tested to avoid unintended filtering.Verification successful
Given the context and the need to verify the correctness of the new skipping logic in the
parseSamStream
method, we will proceed by looking for documentation or comments related to the use ofchrConverter.isCompressedMref()
for filtering alignments. Additionally, we will search for unit tests that cover this method, focusing on the new logic. This approach will help ensure that the changes do not unintentionally exclude valid alignments.
The verification process has confirmed that the new skipping logic in the
parseSamStream
method, which utilizeschrConverter.isCompressedMref
to filter alignments, is based on well-understood and tested functionality. The documentation and tests related toisCompressedMref
provide confidence that this method accurately identifies the intended chromosomes for inclusion or exclusion, aligning with the goal of enhancing performance and correctness without unintentionally filtering out valid alignments.However, the search did not reveal specific unit tests for the
parseSamStream
method itself, suggesting a potential gap in test coverage that could be addressed in future development efforts.* 76-91: The `switchChromosome` method includes additional logic for proper pair compensation mode and clearing pools. This enhancement is crucial for handling chromosome switches accurately, especially in scenarios involving proper pair compensation. It's important to verify that the clearing of pools and the handling of proper pair compensation mode do not introduce any unintended side effects. * 97-141: The `printBps` method now uses `ChrSize` for alignment positions and includes additional coverage calculations. This change enhances the method's accuracy and compatibility with the updated type system. However, it's essential to ensure that the additional coverage calculations are correct and do not impact performance negatively. * 177-407: The `incrementCoverages` method has been modified to handle coverage profiles based on alignment positions and types more accurately. This includes different handling for soft and hard alignments based on read types. It's crucial to ensure that these changes accurately reflect the intended logic and do not introduce any regressions in coverage calculation. * 444-480: The `assignBps` method now handles soft and hard alignments differently based on read types. This change is aligned with the PR objectives and enhances the method's ability to accurately assign breakpoints based on alignment characteristics. It's important to verify that this new handling is correctly implemented and thoroughly tested.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the new skipping logic in parseSamStream does not unintentionally exclude valid alignments. # This script is a placeholder for the actual verification process, which should include reviewing documentation and running unit tests. echo "Review documentation and run unit tests to verify new alignment skipping logic."Length of output: 168
Script:
#!/bin/bash # Search for documentation or comments related to chrConverter.isCompressedMref() echo "Searching for documentation or comments related to chrConverter.isCompressedMref():" rg "isCompressedMref" --context 5 # Search for unit tests covering the parseSamStream method echo -e "\nSearching for unit tests covering the parseSamStream method:" rg "parseSamStream" --files-with-matches | grep -i "test"Length of output: 18447
src/Alignment.cpp (9)
- 41-49: Static member variables
LOW_QUAL_CLIP_THRESHOLD
,BASE_QUALITY_THRESHOLD
,BASE_QUALITY_THRESHOLD_LOW
,CLIPPED_NUCLEOTIDE_COUNT_THRESHOLD
,INDEL_NUCLEOTIDE_COUNT_THRESHOLD
, andISIZEMAX
are correctly defined outside the class definition. This follows the C++ standard for static member initialization and addresses the previous comment regardingISIZEMAX
.- 666-725: The
generateSuppAlignments
method has been refactored to improve readability and maintainability, addressing the previous comment about its complexity. The use of vectors to store iterators for supplementary alignment (SA) tag parsing and the separation of logic for handling outlier mate distances are positive changes. However, ensure that magic numbers and conditions are well-documented or replaced with named constants where applicable, especially in the parsing of theSA
tag and the assessment of outlier mate distance.Ensure that all magic numbers and conditions within the
generateSuppAlignments
method are well-documented or replaced with named constants for clarity.
- 110-123: The parsing of the SAM flag to set
supplementary
,fwdStrand
, andinvertedMate
variables is correctly implemented using bitwise operations on the flag. This is a concise and efficient way to extract flag information. However, consider adding comments to explain the significance of the specific bits being checked (e.g., bit 11 for supplementary, bit 4 for forward strand of the primary alignment, bit 5 for forward strand of the mate) for readers unfamiliar with the SAM format.Consider adding comments explaining the significance of specific bits in the SAM flag for clarity.
- 242-271: The
isEventCandidate
method efficiently determines whether an alignment is an event candidate based on the last CIGAR code or the presence of specific CIGAR operations. This method is well-implemented and follows best practices for parsing and evaluating CIGAR strings. However, adding a brief comment explaining the rationale behind considering these specific CIGAR operations as indicators of an event candidate could enhance code readability.Consider adding a brief comment explaining why specific CIGAR operations indicate an event candidate.
- 273-336: The
createCigarChunks
method processes the CIGAR string to create chunks representing different operations. This method is crucial for accurately interpreting the alignment and identifying potential breakpoints. The implementation appears to be correct and efficient. However, ensure that the handling of edge cases, such as unusual or malformed CIGAR strings, is robust and that appropriate error handling or validation is in place.- 397-417: The
qualityCheckCascade
method implements a series of quality checks to determine thereadType
. This method is a good example of breaking down complex logic into smaller, more manageable parts. However, ensure that each quality check method called within this cascade (e.g.,clipCountCheck
,uniqueSuppCheck
) is well-documented and tested, as these checks are critical for the tool's accuracy.- 520-553: The
overhangMedianQuality
method calculates the median quality of overhangs, which is important for assessing the reliability of alignments. The method's implementation is correct, and the use of a vector to store base qualities is efficient. To enhance code clarity, consider adding comments explaining the significance of the median quality calculation and how it impacts the assessment of alignments.Consider adding comments explaining the significance of the median quality calculation in the
overhangMedianQuality
method.
- 728-743: The
printOverhang
method generates a string representation of the overhang associated with the chosen breakpoint. This method is correctly implemented and efficiently constructs the string. However, adding a brief comment explaining the format of the generated string (e.g., the significance of the pipe symbol and the appended number in parentheses) could enhance readability for future maintainers.Consider adding a brief comment explaining the format of the string generated by the
printOverhang
method.
- 796-829: The
fullMedianQuality
andgetMedian
methods are correctly implemented for calculating the median quality of a sequence. These methods are essential for assessing the quality of alignments and overhangs. The use of templates forgetMedian
enhances its versatility. Ensure that these methods are thoroughly tested with various input sequences to validate their correctness and robustness.src/Hg37ChrConverter.cpp (1)
- 293-302: The explanation regarding the change from
-2
toINVALID (1003)
for representing invalid indices is helpful. However, consider further clarifying the implications of this change for downstream code, especially regarding the unsigned integer representation and compile-time checks.src/AnnotationProcessor.cpp (9)
- 42-42: Renaming
ABRIDGEDOUTPUT
toABRIDGED_OUTPUT
improves readability and follows the naming convention for constants.- 48-48: The introduction of
GERMLINE_DB_LIMIT
as a constructor parameter and its subsequent use is a good practice for making the class more configurable.- 56-64: Refactoring the initialization of
tumorResults
andcontrolResults
to use dynamic sizing based onnCompressedMrefChromosomes
is a significant improvement for flexibility and maintainability.- 73-89: The logic for processing tumor results has been updated to use the
ChrConverter
fromGlobalAppConfig
. This centralization is a positive change for maintainability and consistency across the application.- 107-107: Using
DeFuzzier
with a dynamic threshold based ondefaultReadLengthTumorIn
is a good practice for making the fuzziness threshold adaptable to different read lengths.- 127-150: The constructor overload for handling control results introduces similar improvements as seen with tumor results, including dynamic sizing and usage of
ChrConverter
. This consistency in handling both tumor and control results is commendable.- 156-169: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [159-257]
The logic for processing control results has been updated similarly to tumor results, including the use of
ChrConverter
and handling of overhangs. This consistency and centralization of chromosome conversion logic are beneficial.
- 321-331: The
searchMatches
method's refactoring to useChrConverter
fromGlobalAppConfig
for chromosome index conversion is a good example of centralizing configuration and utility functions.- 337-398: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [340-443]
The
searchSa
method's logic has been refined, including pre-checks for creating unknown match SVs and handling of supplementary alignments. These changes likely improve the accuracy and efficiency of SV matching.
Currently, SOPHIA only is optimized for the hg37 assembly of the human genome. | ||
|
||
> **NOTE**: We are preparing hg38 support. See the [Changes](#changes) section for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The note about preparing hg38 support might be misleading since the PR aims to introduce hg38 support. Consider updating this note to reflect that hg38 support is now available, rather than in preparation.
- > **NOTE**: We are preparing hg38 support. See the [Changes](#changes) section for details.
+ > **NOTE**: hg38 support is now available. See the [Changes](#changes) section for details.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Currently, SOPHIA only is optimized for the hg37 assembly of the human genome. | |
> **NOTE**: We are preparing hg38 support. See the [Changes](#changes) section for details. | |
Currently, SOPHIA only is optimized for the hg37 assembly of the human genome. | |
> **NOTE**: hg38 support is now available. See the [Changes](#changes) section for details. |
"505", "506", "507", "508", "509", | ||
"510", "511", "512", "513", "514", | ||
"515", "516", "517", "518", "519", | ||
"520", "521", "522", "523", "524", | ||
"525", "526", "527", "528", "529", | ||
"530", "531", "532", "533", "534", | ||
"535", "536", "537", "538", "539", | ||
"540", "541", "542", "543", "544", | ||
"545", "546", "547", "548", "549", | ||
"550", "551", "552", "553", "554", | ||
"555", "556", "557", "558", "559", | ||
"560", "561", "562", "563", "564", | ||
"565", "566", "567", "568", "569", | ||
"570", "571", "572", "573", "574", | ||
"575", "576", "577", "578", "579", | ||
"580", "581", "582", "583", "584", | ||
"585", "586", "587", "588", "589", | ||
"590", "591", "592", "593", "594", | ||
"595", "596", "597", "598", "599", | ||
"600", "601", "602", "603", "604", | ||
"605", "606", "607", "608", "609", | ||
"610", "611", "612", "613", "614", | ||
"615", "616", "617", "618", "619", | ||
"620", "621", "622", "623", "624", | ||
"625", "626", "627", "628", "629", | ||
"630", "631", "632", "633", "634", | ||
"635", "636", "637", "638", "639", | ||
"640", "641", "642", "643", "644", | ||
"645", "646", "647", "648", "649", | ||
"650", "651", "652", "653", "654", | ||
"655", "656", "657", "658", "659", | ||
"660", "661", "662", "663", "664", | ||
"665", "666", "667", "668", "669", | ||
"670", "671", "672", "673", "674", | ||
"675", "676", "677", "678", "679", | ||
"680", "681", "682", "683", "684", | ||
"685", "686", "687", "688", "689", | ||
"690", "691", "692", "693", "694", | ||
"695", "696", "697", "698", "699", | ||
"700", "701", "702", "703", "704", | ||
"705", "706", "707", "708", "709", | ||
"710", "711", "712", "713", "714", | ||
"715", "716", "717", "718", "719", | ||
"720", "721", "722", "723", "724", | ||
"725", "726", "727", "728", "729", | ||
"730", "731", "732", "733", "734", | ||
"735", "736", "737", "738", "739", | ||
"740", "741", "742", "743", "744", | ||
"745", "746", "747", "748", "749", | ||
"750", "751", "752", "753", "754", | ||
"755", "756", "757", "758", "759", | ||
"760", "761", "762", "763", "764", | ||
"765", "766", "767", "768", "769", | ||
"770", "771", "772", "773", "774", | ||
"775", "776", "777", "778", "779", | ||
"780", "781", "782", "783", "784", | ||
"785", "786", "787", "788", "789", | ||
"790", "791", "792", "793", "794", | ||
"795", "796", "797", "798", "799", | ||
"800", "801", "802", "803", "804", | ||
"805", "806", "807", "808", "809", | ||
"810", "811", "812", "813", "814", | ||
"815", "816", "817", "818", "819", | ||
"820", "821", "822", "823", "824", | ||
"825", "826", "827", "828", "829", | ||
"830", "831", "832", "833", "834", | ||
"835", "836", "837", "838", "839", | ||
"840", "841", "842", "843", "844", | ||
"845", "846", "847", "848", "849", | ||
"850", "851", "852", "853", "854", | ||
"855", "856", "857", "858", "859", | ||
"860", "861", "862", "863", "864", | ||
"865", "866", "867", "868", "869", | ||
"870", "871", "872", "873", "874", | ||
"875", "876", "877", "878", "879", | ||
"880", "881", "882", "883", "884", | ||
"885", "886", "887", "888", "889", | ||
"890", "891", "892", "893", "894", | ||
"895", "896", "897", "898", "899", | ||
"900", "901", "902", "903", "904", | ||
"905", "906", "907", "908", "909", | ||
"910", "911", "912", "913", "914", | ||
"915", "916", "917", "918", "919", | ||
"920", "921", "922", "923", "924", | ||
"925", "926", "927", "928", "929", | ||
"930", "931", "932", "933", "934", | ||
"935", "936", "937", "938", "939", | ||
"940", "941", "942", "943", "944", | ||
"945", "946", "947", "948", "949", | ||
"950", "951", "952", "953", "954", | ||
"955", "956", "957", "958", "959", | ||
"960", "961", "962", "963", "964", | ||
"965", "966", "967", "968", "969", | ||
"970", "971", "972", "973", "974", | ||
"975", "976", "977", "978", "979", | ||
"980", "981", "982", "983", "984", | ||
"985", "986", "987", "988", "989", | ||
"990", "991", "992", "993", "994", | ||
"995", "996", "997", "998", "hs37d5", | ||
"NC_007605", "MT", "phiX174", "INVALID"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The static mapping of chromosome names to indices (indexToChrName
and compressedMrefIndexToChrName
) and chromosome sizes (chrSizesCompressedMref
) are hardcoded. While this is acceptable for constants, consider externalizing such data for maintainability and flexibility, especially if these might change or need to be reused.
static const std::vector<ChrSize> chrSizesCompressedMref { | ||
249250622, 243199374, 198022431, 191154277, 180915261, 171115068, | ||
159138664, 146364023, 141213432, 135534748, 135006517, 133851896, | ||
115169879, 107349541, 102531393, 90354754, 81195211, 78077249, | ||
59128984, 63025521, 48129896, 51304567, 155270561, 59373567, | ||
106434, 547497, 189790, 191470, 182897, 38915, | ||
37176, 90086, 169875, 187036, 36149, 40104, | ||
37499, 81311, 174589, 41002, 4263, 92690, | ||
159170, 27683, 166567, 186859, 164240, 137719, | ||
172546, 172295, 172150, 161148, 179199, 161803, | ||
155398, 186862, 180456, 179694, 211174, 15009, | ||
128375, 129121, 19914, 43692, 27387, 40653, | ||
45942, 40532, 34475, 41935, 45868, 39940, | ||
33825, 41934, 42153, 43524, 43342, 39930, | ||
36652, 38155, 36423, 39787, 38503, 35477944, | ||
171824}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded chromosome sizes in chrSizesCompressedMref
are noted to differ from the 1K genomes reference. It's crucial to document the source and reasoning for these discrepancies to ensure clarity and correctness for future maintainers and users.
Hg37ChrConverter::Hg37ChrConverter(const std::vector<ChrName> &indexToChrName, | ||
const std::vector<ChrName> &compressedMrefIndexToChrName, | ||
const std::vector<ChrSize> &chrSizesCompressedMref, | ||
const std::vector<CompressedMrefIndex> &indexToCompressedMrefIndex) : | ||
ChrConverter("classic_hg37"), | ||
_indexToChrName {indexToChrName}, | ||
_compressedMrefIndexToChrName {compressedMrefIndexToChrName}, | ||
_chrSizesCompressedMref {chrSizesCompressedMref}, | ||
_indexToCompressedMrefIndex {indexToCompressedMrefIndex}, | ||
_compressedMrefIndexToIndex {_buildCompressedMrefIndexToIndex( | ||
compressedMrefIndexToChrName.size(), | ||
indexToCompressedMrefIndex)}{ | ||
if (indexToChrName.size() != indexToCompressedMrefIndex.size()) | ||
throw_with_trace(std::invalid_argument( | ||
"indexToChrName and indexToCompressedMrefIndex must have the same size. " | ||
"Found sizes: indexToChrName=" + std::to_string(indexToChrName.size()) + | ||
", indexToCompressedMrefIndex=" + std::to_string(indexToCompressedMrefIndex.size()))); | ||
if (compressedMrefIndexToChrName.size() != chrSizesCompressedMref.size()) | ||
throw_with_trace(std::invalid_argument( | ||
"compressedMrefIndexToChrName and chrSizesCompressedMref must have the same size. " | ||
"Found sizes: compressedMrefIndexToChrName=" + std::to_string(compressedMrefIndexToChrName.size()) + | ||
", chrSizesCompressedMref=" + std::to_string(chrSizesCompressedMref.size()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor Hg37ChrConverter
performs size checks to ensure consistency between related data structures. This is good practice. However, consider adding more detailed error messages to specify which sizes are mismatched, enhancing the debugging process for future developers.
/* This is parsing code. It takes a position in a character stream, and translates the | ||
following character(s) into index positions (see ChrConverter::indexToChrName). It is slightly | ||
modified from the original implementation by Umut Toprak. | ||
|
||
If the first position is a digit, read up to the next stopChar. | ||
|
||
* (\d+)$ -> $1 | ||
|
||
If the first position is *not* a digit return indices according to the following rules: | ||
|
||
* h -> 999 | ||
* X -> 40 | ||
* Y -> 41 | ||
* MT -> 1001 | ||
* G?(\d+)\. -> $1 | ||
* N -> 1000 | ||
* p -> 1002 | ||
|
||
NOTE: Most of the matches are eager matches, which means the algorithm does not check for | ||
whether the end iterator or the stopChar is actually reached or whether it follows | ||
any expected pattern! The actual stopChar is not actually checked in these cases. | ||
|
||
All identifiers not matching any of these rules will throw an exception (domain_error). | ||
|
||
IMPORTANT: The hg37 parser here ignores the stopCharExt, but instead remains with the legacy | ||
behavior. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parsing logic in parseChrAndReturnIndex
is complex and handles various chromosome naming conventions. Given the complexity and potential for errors, consider adding unit tests specifically targeting this function to ensure its correctness across a wide range of inputs.
Would you like me to help by adding unit tests for this parsing logic?
include/Breakpoint.h
Outdated
bool closeToSupp(const SuppAlignment &compIn, ChrDistance fuzziness) const { | ||
if (chrIndex == compIn.getChrIndex()) { | ||
if (compIn.isFuzzy()) { | ||
fuzziness = ChrDistance(trunc(2.5 * static_cast<int>(DEFAULT_READ_LENGTH))); | ||
return (ChrDistance(static_cast<int>(pos)) - fuzziness) <= | ||
(ChrDistance(static_cast<int>(compIn.getExtendedPos())) + fuzziness) && | ||
(ChrDistance(static_cast<int>(compIn.getPos())) - fuzziness) <= | ||
(ChrDistance(static_cast<int>(pos)) + fuzziness); | ||
} else { | ||
return ChrDistance(abs(static_cast<int>(pos) - static_cast<int>(compIn.getPos()))) <= fuzziness; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method closeToSupp
checks if the breakpoint is close to a supplementary alignment within a specified fuzziness. The logic correctly handles both fuzzy and non-fuzzy alignments. However, the magic number 2.5
used in the calculation of fuzziness for fuzzy alignments (line 138) should be defined as a constant or explained via a comment for better clarity and maintainability.
+ // The factor 2.5 is used to adjust fuzziness for fuzzy alignments, based on empirical observations.
138 fuzziness = ChrDistance(trunc(2.5 * static_cast<int>(DEFAULT_READ_LENGTH)));
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
bool closeToSupp(const SuppAlignment &compIn, ChrDistance fuzziness) const { | |
if (chrIndex == compIn.getChrIndex()) { | |
if (compIn.isFuzzy()) { | |
fuzziness = ChrDistance(trunc(2.5 * static_cast<int>(DEFAULT_READ_LENGTH))); | |
return (ChrDistance(static_cast<int>(pos)) - fuzziness) <= | |
(ChrDistance(static_cast<int>(compIn.getExtendedPos())) + fuzziness) && | |
(ChrDistance(static_cast<int>(compIn.getPos())) - fuzziness) <= | |
(ChrDistance(static_cast<int>(pos)) + fuzziness); | |
} else { | |
return ChrDistance(abs(static_cast<int>(pos) - static_cast<int>(compIn.getPos()))) <= fuzziness; | |
} | |
bool closeToSupp(const SuppAlignment &compIn, ChrDistance fuzziness) const { | |
if (chrIndex == compIn.getChrIndex()) { | |
if (compIn.isFuzzy()) { | |
// The factor 2.5 is used to adjust fuzziness for fuzzy alignments, based on empirical observations. | |
fuzziness = ChrDistance(trunc(2.5 * static_cast<int>(DEFAULT_READ_LENGTH))); | |
return (ChrDistance(static_cast<int>(pos)) - fuzziness) <= | |
(ChrDistance(static_cast<int>(compIn.getExtendedPos())) + fuzziness) && | |
(ChrDistance(static_cast<int>(compIn.getPos())) - fuzziness) <= | |
(ChrDistance(static_cast<int>(pos)) + fuzziness); | |
} else { | |
return ChrDistance(abs(static_cast<int>(pos) - static_cast<int>(compIn.getPos()))) <= fuzziness; | |
} |
include/MrefEntry.h
Outdated
ChrSize getPos() const { | ||
if (!isValid()) { | ||
throw_with_trace(logic_error("MrefEntry is invalid")); | ||
} | ||
return pos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of exception throwing in getPos()
for invalid MrefEntry
instances is a robust error handling strategy. It prevents undefined behavior by ensuring that the position is only accessed when the entry is valid. Consider adding documentation to the method to inform users about this behavior.
+ /**
+ * @brief Gets the position of the MrefEntry.
+ * @throws logic_error if the MrefEntry is invalid.
+ * @return The position of the MrefEntry.
+ */
ChrSize getPos() const;
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
ChrSize getPos() const { | |
if (!isValid()) { | |
throw_with_trace(logic_error("MrefEntry is invalid")); | |
} | |
return pos; | |
/** | |
* @brief Gets the position of the MrefEntry. | |
* @throws logic_error if the MrefEntry is invalid. | |
* @return The position of the MrefEntry. | |
*/ | |
ChrSize getPos() const { | |
if (!isValid()) { | |
throw_with_trace(logic_error("MrefEntry is invalid")); | |
} | |
return pos; |
bool MrefEntry::saMatcher(SuppAlignmentAnno* saPtr) { | ||
if (saPtr->isToRemove() | ||
|| saPtr->isSuspicious() | ||
|| (saPtr->getExpectedDiscordants() > 0 | ||
&& !(saPtr->getMateSupport() / (saPtr->getExpectedDiscordants() + 0.0) > 0.1))) { | ||
return true; | ||
} | ||
for (auto &sa : suppAlignments) { | ||
if (sa.saCloseness(*saPtr, 100)) { | ||
if (sa.isFuzzy()) { | ||
if (saPtr->isFuzzy()) { | ||
sa.extendSuppAlignment(saPtr->getPos(), saPtr->getExtendedPos()); | ||
} else { | ||
sa.removeFuzziness(*saPtr); | ||
} | ||
} | ||
sa.addSupportingIndices(saPtr->getSupportingIndices()); | ||
sa.mergeMrefSa(*saPtr); | ||
saPtr->setToRemove(true); | ||
return true; | ||
} | ||
} | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The saMatcher
method checks if a supplementary alignment should be merged with existing ones based on closeness and other criteria. While the method's logic is sound, the hardcoded value 100
used as a threshold in saCloseness
(line 243) should be defined as a constant or configurable parameter for better flexibility and clarity.
- if (sa.saCloseness(*saPtr, 100)) {
+ // The threshold for determining closeness between supplementary alignments.
+ const int SA_CLOSENESS_THRESHOLD = 100;
+ if (sa.saCloseness(*saPtr, SA_CLOSENESS_THRESHOLD)) {
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
bool MrefEntry::saMatcher(SuppAlignmentAnno* saPtr) { | |
if (saPtr->isToRemove() | |
|| saPtr->isSuspicious() | |
|| (saPtr->getExpectedDiscordants() > 0 | |
&& !(saPtr->getMateSupport() / (saPtr->getExpectedDiscordants() + 0.0) > 0.1))) { | |
return true; | |
} | |
for (auto &sa : suppAlignments) { | |
if (sa.saCloseness(*saPtr, 100)) { | |
if (sa.isFuzzy()) { | |
if (saPtr->isFuzzy()) { | |
sa.extendSuppAlignment(saPtr->getPos(), saPtr->getExtendedPos()); | |
} else { | |
sa.removeFuzziness(*saPtr); | |
} | |
} | |
sa.addSupportingIndices(saPtr->getSupportingIndices()); | |
sa.mergeMrefSa(*saPtr); | |
saPtr->setToRemove(true); | |
return true; | |
} | |
} | |
return false; | |
} | |
bool MrefEntry::saMatcher(SuppAlignmentAnno* saPtr) { | |
if (saPtr->isToRemove() | |
|| saPtr->isSuspicious() | |
|| (saPtr->getExpectedDiscordants() > 0 | |
&& !(saPtr->getMateSupport() / (saPtr->getExpectedDiscordants() + 0.0) > 0.1))) { | |
return true; | |
} | |
for (auto &sa : suppAlignments) { | |
// The threshold for determining closeness between supplementary alignments. | |
const int SA_CLOSENESS_THRESHOLD = 100; | |
if (sa.saCloseness(*saPtr, SA_CLOSENESS_THRESHOLD)) { | |
if (sa.isFuzzy()) { | |
if (saPtr->isFuzzy()) { | |
sa.extendSuppAlignment(saPtr->getPos(), saPtr->getExtendedPos()); | |
} else { | |
sa.removeFuzziness(*saPtr); | |
} | |
} | |
sa.addSupportingIndices(saPtr->getSupportingIndices()); | |
sa.mergeMrefSa(*saPtr); | |
saPtr->setToRemove(true); | |
return true; | |
} | |
} | |
return false; | |
} |
src/sophiaMref.cpp
Outdated
std::optional<std::string> assemblyNameOpt { }; | ||
if (inputVariables.count("assemblyname")) { | ||
assemblyNameOpt = inputVariables["assemblyname"].as<string>(); | ||
} | ||
setApplicationConfig(assemblyNameOpt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional setting of assemblyNameOpt
based on the presence of the "assemblyname" argument is a good practice. It allows for optional configuration while providing a default value. Consider adding validation for the assembly name to ensure it matches expected values.
if (inputVariables.count("assemblyname")) {
assemblyNameOpt = inputVariables["assemblyname"].as<string>();
+ // Validate assembly name
+ if (!isValidAssemblyName(assemblyNameOpt.value())) {
+ cerr << "Invalid assembly name provided." << endl;
+ return 1;
+ }
}
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
std::optional<std::string> assemblyNameOpt { }; | |
if (inputVariables.count("assemblyname")) { | |
assemblyNameOpt = inputVariables["assemblyname"].as<string>(); | |
} | |
setApplicationConfig(assemblyNameOpt); | |
std::optional<std::string> assemblyNameOpt { }; | |
if (inputVariables.count("assemblyname")) { | |
assemblyNameOpt = inputVariables["assemblyname"].as<string>(); | |
// Validate assembly name | |
if (!isValidAssemblyName(assemblyNameOpt.value())) { | |
cerr << "Invalid assembly name provided." << endl; | |
return 1; | |
} | |
} | |
setApplicationConfig(assemblyNameOpt); |
src/Sdust.cpp
Outdated
void | ||
Sdust::shiftWindow(int t) { | ||
if (t < 0) { | ||
throw_with_trace(std::logic_error("Sdust::shiftWindow index variable t < 0")); | ||
} | ||
if (w.size() >= WINDOW_SIZE - 2) { | ||
auto s = w.front(); | ||
w.pop_front(); | ||
removeTripletInfo(rW, cW, s); | ||
if (L > static_cast<int>(w.size())) { | ||
--L; | ||
removeTripletInfo(rV, cV, s); | ||
} | ||
} | ||
w.push_back(t); | ||
++L; | ||
addTripletInfo(rW, cW, t); | ||
addTripletInfo(rV, cV, t); | ||
if ((cV[static_cast<unsigned int>(t)] * 10) > (SCORE_THRESHOLD * 2)) { | ||
int s{0}; | ||
do { | ||
int idx = static_cast<int>(w.size()) - static_cast<int>(L); | ||
if (idx < 0) { | ||
throw_with_trace(std::logic_error("Sdust::shiftWindow index (w.size() - L) < 0")); | ||
} | ||
s = w[static_cast<unsigned int>(idx)]; | ||
removeTripletInfo(rV, cV, s); | ||
--L; | ||
} while (s != t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shiftWindow
function has been refined for better window manipulation and error handling. The checks for negative indices and the loop for adjusting the window based on the score threshold are important for ensuring correctness. Consider adding more comments to explain the logic, especially for the loop that adjusts the window size.
+ // Adjust the window size based on the score threshold
do {
int idx = static_cast<int>(w.size()) - static_cast<int>(L);
// Error handling for negative index
if (idx < 0) {
throw_with_trace(std::logic_error("Sdust::shiftWindow index (w.size() - L) < 0"));
}
s = w[static_cast<unsigned int>(idx)];
removeTripletInfo(rV, cV, s);
--L;
} while (s != t);
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
void | |
Sdust::shiftWindow(int t) { | |
if (t < 0) { | |
throw_with_trace(std::logic_error("Sdust::shiftWindow index variable t < 0")); | |
} | |
if (w.size() >= WINDOW_SIZE - 2) { | |
auto s = w.front(); | |
w.pop_front(); | |
removeTripletInfo(rW, cW, s); | |
if (L > static_cast<int>(w.size())) { | |
--L; | |
removeTripletInfo(rV, cV, s); | |
} | |
} | |
w.push_back(t); | |
++L; | |
addTripletInfo(rW, cW, t); | |
addTripletInfo(rV, cV, t); | |
if ((cV[static_cast<unsigned int>(t)] * 10) > (SCORE_THRESHOLD * 2)) { | |
int s{0}; | |
do { | |
int idx = static_cast<int>(w.size()) - static_cast<int>(L); | |
if (idx < 0) { | |
throw_with_trace(std::logic_error("Sdust::shiftWindow index (w.size() - L) < 0")); | |
} | |
s = w[static_cast<unsigned int>(idx)]; | |
removeTripletInfo(rV, cV, s); | |
--L; | |
} while (s != t); | |
void | |
Sdust::shiftWindow(int t) { | |
if (t < 0) { | |
throw_with_trace(std::logic_error("Sdust::shiftWindow index variable t < 0")); | |
} | |
if (w.size() >= WINDOW_SIZE - 2) { | |
auto s = w.front(); | |
w.pop_front(); | |
removeTripletInfo(rW, cW, s); | |
if (L > static_cast<int>(w.size())) { | |
--L; | |
removeTripletInfo(rV, cV, s); | |
} | |
} | |
w.push_back(t); | |
++L; | |
addTripletInfo(rW, cW, t); | |
addTripletInfo(rV, cV, t); | |
if ((cV[static_cast<unsigned int>(t)] * 10) > (SCORE_THRESHOLD * 2)) { | |
int s{0}; | |
// Adjust the window size based on the score threshold | |
do { | |
int idx = static_cast<int>(w.size()) - static_cast<int>(L); | |
// Error handling for negative index | |
if (idx < 0) { | |
throw_with_trace(std::logic_error("Sdust::shiftWindow index (w.size() - L) < 0")); | |
} | |
s = w[static_cast<unsigned int>(idx)]; | |
removeTripletInfo(rV, cV, s); | |
--L; | |
} while (s != t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 12
Configuration used: CodeRabbit UI
Files selected for processing (30)
- include/Alignment.h (2 hunks)
- include/AnnotationProcessor.h (2 hunks)
- include/Breakpoint.h (1 hunks)
- include/BreakpointReduced.h (2 hunks)
- include/ChosenBp.h (1 hunks)
- include/DeFuzzier.h (1 hunks)
- include/GermlineMatch.h (1 hunks)
- include/MasterRefProcessor.h (2 hunks)
- include/MrefEntry.h (1 hunks)
- include/MrefEntryAnno.h (1 hunks)
- include/MrefMatch.h (1 hunks)
- include/SamSegmentMapper.h (2 hunks)
- include/Sdust.h (1 hunks)
- include/SuppAlignment.h (1 hunks)
- include/SuppAlignmentAnno.h (2 hunks)
- include/SvEvent.h (2 hunks)
- src/Alignment.cpp (1 hunks)
- src/AnnotationProcessor.cpp (3 hunks)
- src/BreakpointReduced.cpp (1 hunks)
- src/ChosenBp.cpp (1 hunks)
- src/DeFuzzier.cpp (1 hunks)
- src/GermlineMatch.cpp (1 hunks)
- src/GlobalAppConfig.cpp (1 hunks)
- src/MasterRefProcessor.cpp (1 hunks)
- src/MrefEntry.cpp (1 hunks)
- src/MrefEntryAnno.cpp (1 hunks)
- src/SamSegmentMapper.cpp (1 hunks)
- src/Sdust.cpp (1 hunks)
- src/SuppAlignment.cpp (1 hunks)
- src/SuppAlignmentAnno.cpp (1 hunks)
Files not summarized due to errors (2)
- src/Alignment.cpp: Error: Message exceeds token limit
- src/AnnotationProcessor.cpp: Error: Message exceeds token limit
Files skipped from review as they are similar to previous changes (9)
- include/BreakpointReduced.h
- include/ChosenBp.h
- include/MasterRefProcessor.h
- include/MrefEntryAnno.h
- include/MrefMatch.h
- include/SamSegmentMapper.h
- src/DeFuzzier.cpp
- src/MrefEntry.cpp
- src/Sdust.cpp
Additional comments: 41
src/GermlineMatch.cpp (1)
- 29-37: The removal of
using namespace std;
and explicit use ofstd::
forvector
andpair
in theGermlineMatch
constructor is a good practice for namespace clarity and avoiding potential name conflicts. Well done on this improvement.include/GermlineMatch.h (1)
- 30-54: The explicit use of
std::
namespace forvector
andpair
in theGermlineMatch
class is a positive change, enhancing code clarity and avoiding potential conflicts with names in the global namespace. This aligns with best practices for C++ development.src/GlobalAppConfig.cpp (2)
- 41-41: The
init
method has been correctly marked asstatic
following the singleton pattern, allowing it to be called without an instance ofGlobalAppConfig
. This change aligns with the singleton design pattern best practices.- 52-52: The
getInstance
method has been correctly marked asstatic
, consistent with the singleton pattern. This ensures that it can be called without an instance ofGlobalAppConfig
, facilitating global access to the singleton instance.src/ChosenBp.cpp (2)
- 30-30: Adding the
BP_SUPPORT_THRESHOLD
static member to theChosenBp
class is a good practice for defining constants. However, ensure that this member is initialized with a meaningful value somewhere in the codebase to avoid undefined behavior.Verification successful
The initialization of
ChosenBp::BP_SUPPORT_THRESHOLD
insrc/sophia.cpp
addresses the concern raised in the review comment about ensuring the member is initialized with a meaningful value. This confirms that the code change adheres to good practices regarding the initialization of static members.* 32-58: The refactoring of the `addChildNode` and `addSupplementaryAlignments` methods improves the clarity and functionality of the `ChosenBp` class. Specifically, the handling of supplementary alignments in `addSupplementaryAlignments` is more robust and clear. Good job on these improvements.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for initialization of BP_SUPPORT_THRESHOLD in the codebase. rg "ChosenBp::BP_SUPPORT_THRESHOLD" --type cppLength of output: 172
include/Sdust.h (2)
- 49-51: Renaming constants
SCORETHRESHOLD
toSCORE_THRESHOLD
andWINDOWSIZE
toWINDOW_SIZE
improves readability and adheres to the common naming conventions in C++. This is a positive change.- 92-102: The modification in the comparison logic of
operator<
in thePerfectInterval
struct makes the comparison more intuitive and correct. This change ensures that intervals are ordered first by their start index and then by their end index, which is a logical approach for interval sorting.include/DeFuzzier.h (2)
- 42-43: Updating the constructor parameters to use
ChrSize
andbool
types enhances type safety and clarity. This change ensures that the parameters are more descriptive and align with the intended use within the class.- 77-79: Renaming constants
MAXDISTANCE
andMREFMODE
toMAX_DISTANCE
andMREF_MODE
respectively improves readability and adheres to common naming conventions. This is a good practice for constant naming.src/MrefEntryAnno.cpp (1)
- 37-89: The modifications in the
MrefEntryAnno
constructor, including the use ofChrSize
for type safety and the addition of exception handling, significantly improve the robustness and clarity of the code. The renaming of constants and the header file also align with best practices. Good job on these changes.include/MrefEntry.h (1)
- 44-109: The comprehensive changes in the
MrefEntry
class, including the addition of aValidityScore
type alias, renaming of static members for clarity, and adjustments to member functions for improved error handling and functionality, significantly enhance the class's design and usability. These modifications align with best practices for C++ development and improve the maintainability of the code.include/AnnotationProcessor.h (1)
- 43-142: The
AnnotationProcessor
class has undergone significant changes, including updates to data types, renaming of variables, and adjustments to function signatures. These modifications enhance code clarity and standardize naming conventions. However, specific attention should be given to the following areas for potential improvements or considerations:
Data Types and Naming Conventions: The changes in data types and naming conventions are positive, contributing to improved code readability and maintainability. Ensure that these changes are consistently applied across the entire codebase to avoid discrepancies.
Function Signatures: The adjustments to function signatures, such as parameter types and names, are beneficial. It's crucial to verify that all calls to these functions throughout the codebase have been updated accordingly to match the new signatures.
Member Variables: The renaming and type changes of member variables enhance understanding of their purpose and usage. Double-check that all accesses to these variables reflect the updated names and types.
Overall Structure: The class structure appears well-organized, with a clear separation of public and private sections. This organization aids in understanding the class's responsibilities and how it interacts with other components.
In summary, the changes to the
AnnotationProcessor
class are commendable, contributing to a more robust and understandable codebase. It's essential to ensure that these changes are consistently applied and integrated throughout the project to fully realize their benefits.include/SuppAlignmentAnno.h (1)
- 35-198: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [38-215]
The modifications in the
SuppAlignmentAnno
class, including the addition of#include "global.h"
, changingint
types toChrSize
andChrIndex
, renamingDEFAULTREADLENGTH
toDEFAULT_READ_LENGTH
, adding documentation comments, and updating function signatures and implementations, are significant improvements. These changes enhance the code's readability, maintainability, and adherence to best practices. Key points to consider:
Inclusion of
global.h
: This inclusion likely provides necessary type definitions (e.g.,ChrSize
andChrIndex
) and constants used within the class. Ensure that this header file's inclusion does not introduce any circular dependencies.Data Type Changes: The transition from
int
to more descriptive types likeChrSize
andChrIndex
is a commendable practice. It improves type safety and code readability by making the data's purpose more explicit.Renaming of Constants: Renaming
DEFAULTREADLENGTH
toDEFAULT_READ_LENGTH
improves readability by adhering to naming conventions. Verify that all references to this constant throughout the codebase have been updated accordingly.Documentation Comments: The addition of documentation comments is beneficial for understanding the class's purpose and usage. Ensure that these comments are accurate and kept up-to-date with any future changes to the class.
Function Signature and Implementation Updates: The updates to function signatures and implementations, such as
extendSuppAlignment
andsaCloseness
, are likely to improve the class's functionality and flexibility. Review these changes carefully to ensure they meet the intended design goals and do not introduce unintended side effects.Overall, the changes to the
SuppAlignmentAnno
class represent a significant improvement in code quality. It's essential to ensure that these changes are thoroughly tested and integrated with the rest of the codebase to maximize their benefits.include/Alignment.h (1)
- 41-215: The
Alignment
class has seen substantial changes, including variable renaming, data type changes, the addition of new member functions, and updates to static variables. These modifications contribute to the class's clarity, functionality, and adherence to best practices. Key areas to focus on include:
Variable Renaming and Data Type Changes: The renaming of variables and updates to their data types enhance the code's readability and maintainability. Ensure consistency in these changes across the entire codebase.
Addition of New Member Functions: The introduction of new member functions likely adds new capabilities or improves existing functionalities. Review these additions for correctness and ensure they align with the class's responsibilities.
Updates to Static Variables: Changes to static variables, such as naming conventions and values, should be carefully evaluated to ensure they do not adversely affect the class's behavior or its interactions with other components.
Overall Class Structure: The class appears well-organized, with a clear distinction between public and private sections. This organization facilitates understanding the class's role within the larger codebase.
In summary, the updates to the
Alignment
class are commendable, contributing to a more robust, readable, and maintainable codebase. It's crucial to ensure that these changes are consistently applied and thoroughly tested to fully realize their benefits.src/BreakpointReduced.cpp (1)
- 35-217: The changes in
BreakpointReduced.cpp
include updating variable names, adding a new include, modifying class member declarations, adjusting function signatures, and updating method implementations. These modifications aim to enhance clarity, adhere to naming conventions, and improve the class's functionality. Considerations for further improvement or verification include:
Inclusion of
GlobalAppConfig.h
: The addition of this include suggests a dependency on global configuration settings. Verify that this dependency is managed appropriately and does not introduce tight coupling between the class and global configuration.Variable Name Updates and Data Type Changes: The renaming of variables and changes to their data types improve readability and expressiveness. Ensure that these changes are consistently applied throughout the codebase.
Function Signature Adjustments: The modifications to function signatures, such as parameter types and names, enhance the class's interface. Confirm that all calls to these functions have been updated to match the new signatures.
Method Implementations: The updates to method implementations, such as
complexRearrangementMateRatioRescue
andprintOverhang
, likely improve the class's functionality. Review these changes carefully to ensure they meet the intended design goals and do not introduce unintended side effects.Overall, the changes to the
BreakpointReduced
class represent a significant improvement in code quality. It's essential to ensure that these changes are thoroughly tested and integrated with the rest of the codebase to maximize their benefits.include/SvEvent.h (1)
- 43-314: The changes in the
SvEvent.h
file, including renaming enum values and variables related to artifact and clonality statuses, adjusting static variables and constants naming conventions, adding new member variables and methods to theSvEvent
class, and modifying method signatures and logic, significantly enhance the class's functionality and clarity. Key points to consider:
Enum and Variable Renaming: The renaming of enum values and variables improves the code's readability and makes the artifact and clonality statuses more understandable. Ensure that these changes are consistently reflected in the codebase where these enums and variables are used.
Static Variables and Constants: The adjustments to naming conventions and values of static variables and constants are positive. Verify that these changes do not adversely affect the class's behavior or its interactions with other components.
Addition of New Member Variables and Methods: The introduction of new member variables and methods likely adds new capabilities or improves existing functionalities. Review these additions for correctness and ensure they align with the class's responsibilities.
Method Signature and Logic Modifications: Changes to method signatures and logic, such as
assessOverhangQualityCompensation
anddetermineEventTypeAndSize
, are likely to improve the class's functionality. Review these changes carefully to ensure they meet the intended design goals and do not introduce unintended side effects.In summary, the updates to the
SvEvent
class are commendable, contributing to a more robust, readable, and maintainable codebase. It's crucial to ensure that these changes are consistently applied and thoroughly tested to fully realize their benefits.include/SuppAlignment.h (11)
- 28-30: Including
"global.h"
,"ChrConverter.h"
, and"GlobalAppConfig.h"
at the beginning of the file ensures that the necessary dependencies are available for theSuppAlignment
class. This is a good practice, especially when the class relies on types or utilities defined in these headers.- 50-61: The
create
static method provides a clear and concise way to instantiateSuppAlignment
objects with all necessary parameters. This approach enhances readability and maintainability by encapsulating the construction logic within a single method. However, it's important to ensure that all parameters are validated before use to prevent invalid states.Consider adding parameter validation within the
create
method to ensure that all inputs meet expected criteria, such as non-negative values for indices and sizes.
- 65-65: The
parseSaSupport
static method is designed to parse supplementary alignment information from the breakpoints BED format. This method's presence indicates an effort to separate parsing logic from the main class functionality, which is a good practice for maintainability and testability.- 71-81: The
parseSamSaTag
method is another example of separating parsing logic, this time for SAM format SA:Z: tags. This method's detailed parameter list reflects the complexity of parsing such tags. It's crucial to ensure that the method handles edge cases and invalid inputs gracefully to avoid unexpected behavior.Consider adding comprehensive error handling and input validation to the
parseSamSaTag
method to ensure robustness, especially given the complexity of SAM format parsing.
- 94-96: The
extendSuppAlignment
method allows for extending the position range of a supplementary alignment. This functionality is essential for handling alignments that span multiple regions. The use ofstd::min
andstd::max
ensures that the extended position range is correctly calculated.- 99-102: The
saCloseness
andsaDistHomologyRescueCloseness
methods are designed to determine the closeness of supplementary alignments, potentially with adjustments for homology rescue. These methods likely play a critical role in alignment processing. It's important to ensure that the logic for calculating closeness is accurate and efficient.Verify the correctness and efficiency of the closeness calculation logic in
saCloseness
andsaDistHomologyRescueCloseness
methods, especially in the context of homology rescue scenarios.
- 106-112: The
removeFuzziness
method directly modifies the position and extended position based on anotherSuppAlignment
object, and potentially marks the alignment as distant. This method's functionality suggests it's used to refine alignment data. Care should be taken to ensure that this method is called in appropriate contexts to avoid unintended modifications to alignment data.Ensure that the
removeFuzziness
method is used judiciously within the codebase to prevent unintended side effects on alignment data.
- 135-150: The methods for adding supporting indices (
addSupportingIndices
andaddSecondarySupportIndices
) and the methodfinalizeSupportingIndices
(which is declared but not defined here) suggest that the class manages a complex structure of supporting alignment indices. It's important for these methods to maintain the integrity of the supporting indices data structure, especially when merging or finalizing data.Verify the implementation of
finalizeSupportingIndices
to ensure it correctly finalizes the supporting indices data structure, considering the complexity introduced by the methods for adding supporting indices.
- 196-208: The
mrefSaTransform
andmrefSaConsensus
methods indicate functionality related to transforming and reaching a consensus on supplementary alignments in the context of a master reference. These methods likely play a crucial role in the alignment processing pipeline. It's essential to ensure that the logic for transforming and reaching consensus is accurate and aligns with the intended use cases.Review the logic and use cases for the
mrefSaTransform
andmrefSaConsensus
methods to ensure they accurately perform transformations and reach consensus on supplementary alignments in the context of a master reference.
- 210-219: The
mergeSa
method is designed to merge supplementary alignment data from anotherSuppAlignment
object. This method's implementation involves comparing and potentially updating support and discordant metrics. Given the complexity of merging alignment data, it's crucial to ensure that this method handles all edge cases correctly and maintains the integrity of the alignment data.Carefully review the
mergeSa
method to ensure it correctly handles all edge cases and maintains the integrity of the supplementary alignment data during the merge process.
- 225-242: Similar to the
mergeSa
method, themergeMrefSa
method merges data from anotherSuppAlignment
object, with additional considerations for semi-suspicious alignments. The complexity of merging data while accounting for semi-suspicious alignments requires careful attention to detail to ensure that the method accurately reflects the intended logic and maintains data integrity.Ensure that the
mergeMrefSa
method accurately merges supplementary alignment data, including considerations for semi-suspicious alignments, and maintains the integrity of the data.src/SamSegmentMapper.cpp (3)
- 33-45: The constructor of
SamSegmentMapper
has been updated to useChrSize
forDISCORDANT_LEFT_RANGE
andDISCORDANT_RIGHT_RANGE
. This change improves type safety and clarity by ensuring that these variables are explicitly sized according to their usage in the context of chromosome sizes. However, consider documenting the rationale behind the multipliers3
and2.51
used for calculating these ranges, as magic numbers can reduce code readability.- 49-73: In the
parseSamStream
method, the logic to skip certain alignments based onchrConverter
conditions has been introduced. This change enhances the method's functionality by allowing it to filter out alignments that do not meet specific criteria. Ensure that this new filtering logic aligns with the overall objectives of theSamSegmentMapper
class and does not inadvertently exclude relevant data. Additionally, consider adding unit tests to verify the correct behavior of this updated method under various scenarios.- 77-92: The
switchChromosome
method now includes additional logic for proper pair compensation mode and clearing pools. This enhancement is crucial for handling transitions between chromosomes correctly. However, ensure that the clearing of pools (discordantAlignmentsPool
,discordantAlignmentCandidatesPool
, anddiscordantLowQualAlignmentsPool
) does not lead to the loss of important data that might still be relevant for the analysis of subsequent chromosomes.src/Alignment.cpp (5)
- 39-47: Static member variables
LOW_QUAL_CLIP_THRESHOLD
,BASE_QUALITY_THRESHOLD
,BASE_QUALITY_THRESHOLD_LOW
,CLIPPED_NUCLEOTIDE_COUNT_THRESHOLD
,INDEL_NUCLEOTIDE_COUNT_THRESHOLD
, andISIZEMAX
are correctly defined outside the class definition. This is the proper way to initialize static member variables in C++.- 48-88: The constructor of the
Alignment
class correctly initializes member variables and parses the chromosome index using theChrConverter
from theGlobalAppConfig
singleton. This centralization ofChrConverter
usage aligns with the PR objectives to make the class more flexible and reduce widespread injection. Error handling through exceptions is also appropriately implemented.- 222-235: The
mappingQualityCheck
method correctly sets thereadType
to 7 if the mapping quality is below 13, indicating a low mapping quality. This logic is clear and aligns with the method's purpose. However, ensure that the magic number 13 is well-documented or defined as a constant to improve code readability.Consider defining the magic number 13 as a constant with a descriptive name to improve code readability.
- 243-268: The
isEventCandidate
method correctly identifies if an alignment is a candidate for further processing based on its CIGAR string. The method's logic is sound, and the implementation is straightforward. This method enhances the tool's functionality by enabling more precise handling of alignments.- 395-415: The
qualityCheckCascade
method performs a series of quality checks on the alignment, setting thereadType
based on the results. This method is a good example of decomposing complex logic into smaller, more manageable parts. However, the magic numbers used in the checks (e.g.,LOW_QUAL_CLIP_THRESHOLD
,BASE_QUALITY_THRESHOLD
) should be well-documented or defined as constants if not already done so.src/AnnotationProcessor.cpp (5)
- 28-28: The inclusion of "GlobalAppConfig.h" aligns with the PR objectives to centralize the
ChrConverter
within theGlobalAppConfig
singleton. This is a good practice for managing global configurations and shared resources.- 40-40: The addition of the static member
ABRIDGED_OUTPUT
introduces a useful configuration option for controlling output verbosity. It's recommended to document its purpose and usage to improve code maintainability.- 42-121: The modifications to the
AnnotationProcessor
constructor to utilize theGlobalAppConfig
singleton for accessing theChrConverter
align with the PR objectives of centralizing configuration management. Ensure that the use of the singleton pattern does not introduce unnecessary coupling or hinder testability.- 37-772: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [78-1122]
The modifications and additions to the logic for handling overhangs, searching matches, and filtering results are significant and touch on critical functionality. It's recommended to implement a thorough testing strategy to ensure these changes do not introduce regressions or unintended behavior.
- 1225-1274: The modifications to the
printUnresolvedRareOverhangs
method contribute to the tool's functionality by handling rare overhangs more effectively. Ensure that the output generated by this method is clear and informative for the end-users.
double SuppAlignmentAnno::ISIZEMAX{}; | ||
|
||
ChrSize SuppAlignmentAnno::DEFAULT_READ_LENGTH{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static member variables ISIZEMAX
and DEFAULT_READ_LENGTH
are declared but not initialized with any meaningful values. This could lead to unexpected behavior if these values are used before being explicitly set. Consider initializing these variables with default values or ensuring they are set before use.
const ChrConverter &chrConverter = GlobalAppConfig::getInstance().getChrConverter(); | ||
|
||
// Parse chromosome name. The chromosome name will be separated from the position information | ||
// by a colon ':' character, but as the chromosome name itself may also contain colons, we | ||
// need to anchor first character after the position which is either a `|` or a `(`, and then | ||
// track back to the *last* colon. | ||
try { | ||
chrIndex = chrConverter.parseChrAndReturnIndex( | ||
next(saStrIn.cbegin(), index), | ||
saStrIn.cend(), | ||
':', | ||
STOP_CHARS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of GlobalAppConfig::getInstance().getChrConverter()
to obtain a ChrConverter
instance is repeated multiple times throughout the file. This could be optimized by storing the reference in a local variable at the beginning of the function to reduce redundant calls.
+ const ChrConverter &chrConverter = GlobalAppConfig::getInstance().getChrConverter();
Then replace subsequent calls with chrConverter
.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
const ChrConverter &chrConverter = GlobalAppConfig::getInstance().getChrConverter(); | |
// Parse chromosome name. The chromosome name will be separated from the position information | |
// by a colon ':' character, but as the chromosome name itself may also contain colons, we | |
// need to anchor first character after the position which is either a `|` or a `(`, and then | |
// track back to the *last* colon. | |
try { | |
chrIndex = chrConverter.parseChrAndReturnIndex( | |
next(saStrIn.cbegin(), index), | |
saStrIn.cend(), | |
':', | |
STOP_CHARS); | |
const ChrConverter &chrConverter = GlobalAppConfig::getInstance().getChrConverter(); | |
// Parse chromosome name. The chromosome name will be separated from the position information | |
// by a colon ':' character, but as the chromosome name itself may also contain colons, we | |
// need to anchor first character after the position which is either a `|` or a `(`, and then | |
// track back to the *last* colon. | |
try { | |
chrIndex = chrConverter.parseChrAndReturnIndex( | |
next(saStrIn.cbegin(), index), | |
saStrIn.cend(), | |
':', | |
STOP_CHARS); |
strictFuzzyCandidate{true}, | ||
distant{saAnnoIn.isDistant()}, | ||
suspicious{saAnnoIn.isSuspicious()}, | ||
semiSuspicious{saAnnoIn.isSemiSuspicious()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assignment of defaultReadLengthControl
uses the value from defaultreadlengthtumor
instead of defaultreadlengthcontrol
. This seems to be a copy-paste error and should be corrected to ensure the control read length is correctly assigned.
- defaultReadLengthControl = inputVariables["defaultreadlengthtumor"].as<unsigned int>();
+ defaultReadLengthControl = inputVariables["defaultreadlengthcontrol"].as<unsigned int>();
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
semiSuspicious{saAnnoIn.isSemiSuspicious()}, | |
semiSuspicious{saAnnoIn.isSemiSuspicious()}, |
// Default constructor. | ||
SuppAlignment::SuppAlignment() : | ||
matchFuzziness { 5 * DEFAULT_READ_LENGTH }, | ||
chrIndex { 0 }, | ||
pos { 0 }, | ||
extendedPos { 0 }, | ||
mapq { 0 }, | ||
supportingIndices { }, | ||
supportingIndicesSecondary { }, | ||
distinctReads { 1 }, | ||
support { 0 }, | ||
secondarySupport { 0 }, | ||
mateSupport { 0 }, | ||
expectedDiscordants { 0 }, | ||
encounteredM { false }, | ||
toRemove { false }, | ||
inverted { false }, | ||
fuzzy { false }, | ||
strictFuzzy { false }, | ||
distant { false }, | ||
lowMapqSource { false }, | ||
nullMapqSource { false }, | ||
suspicious { false }, | ||
semiSuspicious { false }, | ||
properPairErrorProne { false }, | ||
primary { true } {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default constructor initializes several member variables but does not initialize matchFuzziness
with a value dependent on DEFAULT_READ_LENGTH
. This could lead to uninitialized or unexpected behavior if DEFAULT_READ_LENGTH
changes. Ensure that matchFuzziness
is correctly initialized, especially if DEFAULT_READ_LENGTH
is expected to change or be set dynamically.
SamSegmentMapper::printBps(ChrSize alignmentStart) { | ||
for (auto &bp : breakpointsCurrent) { | ||
if (!bp.second.isCovFinalized() && ((bp.first) + 1 < alignmentStart)) { | ||
unsigned long posDiff = static_cast<unsigned long>(bp.first - minPos); | ||
if (bp.first != minPos) { | ||
bp.second.setLeftCoverage( | ||
coverageProfiles[static_cast<unsigned long>(bp.first - minPos - 1)].getCoverage()); | ||
} else { | ||
bp.second.setLeftCoverage(0); | ||
} | ||
bp.second.setRightCoverage( | ||
coverageProfiles[posDiff].getCoverage()); | ||
bp.second.setNormalSpans( | ||
coverageProfiles[posDiff].getNormalSpans()); | ||
bp.second.setLowQualSpansSoft( | ||
coverageProfiles[posDiff].getLowQualSpansSoft()); | ||
bp.second.setLowQualSpansHard( | ||
coverageProfiles[posDiff].getLowQualSpansHard()); | ||
bp.second.setUnpairedBreaksSoft( | ||
coverageProfiles[posDiff].getNormalBpsSoft()); | ||
bp.second.setUnpairedBreaksHard( | ||
coverageProfiles[posDiff].getNormalBpsHard()); | ||
bp.second.setBreaksShortIndel( | ||
coverageProfiles[posDiff].getNormalBpsShortIndel()); | ||
bp.second.setLowQualBreaksSoft( | ||
coverageProfiles[posDiff].getLowQualBpsSoft()); | ||
bp.second.setLowQualBreaksHard( | ||
coverageProfiles[posDiff].getLowQualBpsHard()); | ||
bp.second.setCovFinalized(true); | ||
} | ||
bp.second.setRightCoverage(coverageProfiles[posDiff].getCoverage()); | ||
bp.second.setNormalSpans( | ||
coverageProfiles[posDiff].getNormalSpans()); | ||
bp.second.setLowQualSpansSoft( | ||
coverageProfiles[posDiff].getLowQualSpansSoft()); | ||
bp.second.setLowQualSpansHard( | ||
coverageProfiles[posDiff].getLowQualSpansHard()); | ||
bp.second.setUnpairedBreaksSoft( | ||
coverageProfiles[posDiff].getNormalBpsSoft()); | ||
bp.second.setUnpairedBreaksHard( | ||
coverageProfiles[posDiff].getNormalBpsHard()); | ||
bp.second.setBreaksShortIndel( | ||
coverageProfiles[posDiff].getNormalBpsShortIndel()); | ||
bp.second.setLowQualBreaksSoft( | ||
coverageProfiles[posDiff].getLowQualBpsSoft()); | ||
bp.second.setLowQualBreaksHard( | ||
coverageProfiles[posDiff].getLowQualBpsHard()); | ||
bp.second.setCovFinalized(true); | ||
} | ||
} | ||
if (minPos != -1) { | ||
while (minPos + 2 + DISCORDANTLEFTRANGE < alignmentStart) { | ||
if (minPos != maxPos) { | ||
coverageProfiles.pop_front(); | ||
++minPos; | ||
if (minPos != std::numeric_limits<ChrSize>::max()) { | ||
while (minPos + 2 + DISCORDANT_LEFT_RANGE < alignmentStart) { | ||
if (minPos != maxPos) { | ||
coverageProfiles.pop_front(); | ||
++minPos; | ||
} else { | ||
coverageProfiles.clear(); | ||
minPos = std::numeric_limits<ChrSize>::max(); | ||
maxPos = std::numeric_limits<ChrSize>::min(); | ||
break; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The printBps
method has been updated to use ChrSize
for alignment positions and includes additional coverage calculations. This change enhances the method's accuracy and consistency with the rest of the codebase. However, the method appears to be doing more than just printing breakpoints (e.g., updating coverage information), which might violate the Single Responsibility Principle (SRP). Consider refactoring this method to separate concerns more clearly, potentially splitting it into smaller, more focused methods.
} | ||
if (rescueCandidate) { | ||
// eventCandidate, lowMapqCheckFailed, !supplementary, rescueCandidate | ||
readType = 1; | ||
} | ||
qualChecked = true; | ||
} | ||
if (rescueCandidate) { | ||
readType = 1; | ||
} | ||
qualChecked = true; | ||
} | ||
if (readType < 5) { | ||
// Note that here readType is used as ordinal. It might be a score, ...? | ||
qualityCheckCascade(); | ||
} | ||
} else if (readType == 7) { | ||
// !eventCandidate, mapq != 0 && mapq < 13 | ||
readType = 5; | ||
} | ||
if (readType < 5) { | ||
qualityCheckCascade(); | ||
} | ||
} else if (readType == 7) { | ||
readType = 5; | ||
} | ||
switch (readType) { | ||
case 0: | ||
case 3: | ||
case 5: | ||
assessOutlierMateDistance(); | ||
if (distantMate == 1 && readType != 5) { | ||
readType = 4; | ||
} | ||
break; | ||
default: | ||
break; | ||
} | ||
for (auto mpos_cit = samLine.cbegin() + 1 + samChunkPositions[6]; | ||
mpos_cit != samLine.cbegin() + samChunkPositions[7]; ++mpos_cit) { | ||
matePos = matePos * 10 + (*mpos_cit - '0'); | ||
} | ||
if (samLine[1 + samChunkPositions[5]] == '=') { | ||
mateChrIndex = chrIndex; | ||
} else { | ||
mateChrIndex = ChrConverter::readChromosomeIndex( | ||
next(samLine.cbegin(), 1 + samChunkPositions[5]), '\t'); | ||
} | ||
} | ||
|
||
void | ||
Alignment::mappingQualityCheck() { | ||
if (samLine[1 + samChunkPositions[3]] != | ||
'0') { // mapq 0 is treated as a special case, where number of SAs and | ||
// base qualities will be the sole determinants of read quality | ||
nullMapq = false; | ||
switch (samChunkPositions[4] - samChunkPositions[3]) { | ||
case 2: // this checks if the mapq is a single-digit number | ||
readType = 7; | ||
lowMapq = true; | ||
break; | ||
|
||
switch (readType) { | ||
case 0: | ||
case 3: | ||
if (samLine[1 + samChunkPositions[3]] == '1' && | ||
((samLine[2 + samChunkPositions[3]] - '0') < 3)) { | ||
// this checks if the mapq is a two-digit number, and if the | ||
// first digit is a "1", that the second digit is less than 3 | ||
readType = 7; | ||
lowMapq = true; | ||
case 5: | ||
assessOutlierMateDistance(); | ||
if (distantMate == 1 && readType != 5) { | ||
readType = 4; | ||
} | ||
break; | ||
default: | ||
break; | ||
} | ||
|
||
for (auto mpos_cit = samLine.cbegin() + 1 + static_cast<long>(samTabPositions[6]); | ||
mpos_cit != samLine.cbegin() + static_cast<long>(samTabPositions[7]); ++mpos_cit) { | ||
matePos = matePos * 10 + ChrSize(*mpos_cit - '0'); | ||
} | ||
if (samLine[1 + samTabPositions[5]] == '=') { | ||
mateChrIndex = chrIndex; | ||
} else { | ||
try { | ||
mateChrIndex = GlobalAppConfig::getInstance().getChrConverter(). | ||
parseChrAndReturnIndex( | ||
next(samLine.cbegin(), 1 + static_cast<long>(samTabPositions[5])), | ||
samLine.cend(), | ||
'\t'); | ||
} catch (const DomainError &e) { | ||
throw e << error_info_string( | ||
"from = " + std::string(next(samLine.cbegin(), | ||
1 + static_cast<long>(samTabPositions[5])), | ||
samLine.cend())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The continueConstruction
method performs additional setup after the initial construction of an Alignment
object. It includes parsing the SAM line for various alignment properties and setting the readType
based on certain conditions. The method is complex and handles multiple aspects of the alignment. Consider breaking down this method into smaller, more focused methods to improve readability and maintainability. For example, parsing flags, calculating positions, and handling supplementary alignments could be extracted into separate methods.
Consider further refactoring by extracting logical blocks of code into separate methods to improve readability and maintainability.
void | ||
Alignment::createCigarChunks() { | ||
auto encounteredM = false; | ||
auto cumulativeNucleotideCount = 0, | ||
currentNucleotideCount = 0, | ||
indelAdjustment = 0, | ||
leftClipAdjustment = 0, | ||
rightClipAdjustment = 0; | ||
for (auto cigarString_cit = samLine.cbegin() + 1 + static_cast<long>(samTabPositions[4]); | ||
cigarString_cit != samLine.cbegin() + static_cast<long>(samTabPositions[5]); | ||
++cigarString_cit) { | ||
if (isdigit(*cigarString_cit)) { | ||
currentNucleotideCount = | ||
currentNucleotideCount * 10 + (*cigarString_cit - '0'); | ||
} else { | ||
switch (*cigarString_cit) { | ||
case 'M': | ||
encounteredM = true; | ||
cumulativeNucleotideCount += currentNucleotideCount; | ||
break; | ||
case 'S': | ||
cigarChunks.emplace_back( | ||
*cigarString_cit, | ||
encounteredM, | ||
cumulativeNucleotideCount + indelAdjustment - leftClipAdjustment, | ||
currentNucleotideCount, | ||
indelAdjustment - leftClipAdjustment); | ||
cumulativeNucleotideCount += currentNucleotideCount; | ||
if (!encounteredM) { | ||
leftClipAdjustment = currentNucleotideCount; | ||
} else { | ||
rightClipAdjustment = currentNucleotideCount; | ||
} | ||
break; | ||
case 'H': | ||
cigarChunks.emplace_back( | ||
*cigarString_cit, encounteredM, | ||
cumulativeNucleotideCount + indelAdjustment - leftClipAdjustment, | ||
currentNucleotideCount); | ||
break; | ||
case 'I': | ||
cigarChunks.emplace_back( | ||
*cigarString_cit, | ||
encounteredM, | ||
cumulativeNucleotideCount + indelAdjustment - leftClipAdjustment, | ||
currentNucleotideCount); | ||
cumulativeNucleotideCount += currentNucleotideCount; | ||
indelAdjustment -= currentNucleotideCount; | ||
break; | ||
case 'D': | ||
cigarChunks.emplace_back( | ||
*cigarString_cit, | ||
encounteredM, | ||
cumulativeNucleotideCount + indelAdjustment - leftClipAdjustment, | ||
currentNucleotideCount); | ||
indelAdjustment += currentNucleotideCount; | ||
break; | ||
default: | ||
break; | ||
} | ||
currentNucleotideCount = 0; | ||
} | ||
} | ||
endPos += ChrSize(indelAdjustment - leftClipAdjustment - rightClipAdjustment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The createCigarChunks
method parses the CIGAR string and creates chunks representing different types of operations. This method is crucial for the tool's functionality, as it enables detailed analysis of alignments. The implementation is logical and follows the expected steps for parsing CIGAR strings. Consider adding unit tests covering various CIGAR strings to ensure the correctness of this method under different scenarios.
Would you like assistance in creating comprehensive unit tests for the createCigarChunks
method?
void | ||
Alignment::assignBreakpointsAndOverhangs() { | ||
for (const auto &chunk : cigarChunks) { | ||
switch (chunk.chunkType) { | ||
case 'S': | ||
cigarChunks.emplace_back(*cigarString_cit, encounteredM, | ||
cumulativeNucleotideCount + | ||
indelAdjustment - | ||
leftClipAdjustment, | ||
currentNucleotideCount, | ||
indelAdjustment - leftClipAdjustment); | ||
cumulativeNucleotideCount += currentNucleotideCount; | ||
if (!encounteredM) { | ||
leftClipAdjustment = currentNucleotideCount; | ||
readBreakpointTypes.push_back(chunk.chunkType); | ||
readBreakpointSizes.push_back(chunk.length); | ||
readBreakpointsEncounteredM.push_back(chunk.encounteredM); | ||
if (chunk.encounteredM) { | ||
readBreakpoints.push_back(endPos); | ||
readOverhangCoords.emplace_back( | ||
chunk.encounteredM, | ||
endPos, | ||
static_cast<long>(chunk.startPosOnRead) - static_cast<long>(chunk.indelAdjustment), | ||
chunk.length); | ||
} else { | ||
rightClipAdjustment = currentNucleotideCount; | ||
readBreakpoints.push_back(startPos); | ||
readOverhangCoords.emplace_back( | ||
chunk.encounteredM, | ||
startPos, | ||
static_cast<long>(chunk.startPosOnRead) - static_cast<long>(chunk.indelAdjustment), | ||
chunk.length); | ||
} | ||
break; | ||
case 'H': | ||
cigarChunks.emplace_back(*cigarString_cit, encounteredM, | ||
cumulativeNucleotideCount + | ||
indelAdjustment - | ||
leftClipAdjustment, | ||
currentNucleotideCount); | ||
if (chunk.encounteredM) { | ||
readBreakpoints.push_back(endPos); | ||
} else { | ||
readBreakpoints.push_back(startPos); | ||
} | ||
readBreakpointSizes.push_back(chunk.length); | ||
readBreakpointTypes.push_back(chunk.chunkType); | ||
readBreakpointsEncounteredM.push_back(chunk.encounteredM); | ||
break; | ||
case 'I': | ||
cigarChunks.emplace_back(*cigarString_cit, encounteredM, | ||
cumulativeNucleotideCount + | ||
indelAdjustment - | ||
leftClipAdjustment, | ||
currentNucleotideCount); | ||
cumulativeNucleotideCount += currentNucleotideCount; | ||
indelAdjustment -= currentNucleotideCount; | ||
readBreakpoints.push_back(startPos + ChrSize(chunk.startPosOnRead)); | ||
readBreakpointTypes.push_back(chunk.chunkType); | ||
readBreakpointSizes.push_back(chunk.length); | ||
readBreakpointsEncounteredM.push_back(chunk.encounteredM); | ||
break; | ||
case 'D': | ||
cigarChunks.emplace_back(*cigarString_cit, encounteredM, | ||
cumulativeNucleotideCount + | ||
indelAdjustment - | ||
leftClipAdjustment, | ||
currentNucleotideCount); | ||
indelAdjustment += currentNucleotideCount; | ||
readBreakpoints.push_back(startPos + ChrSize(chunk.startPosOnRead)); | ||
readBreakpointTypes.push_back(chunk.chunkType); | ||
readBreakpointSizes.push_back(chunk.length); | ||
readBreakpointsEncounteredM.push_back(chunk.encounteredM); | ||
readBreakpoints.push_back(startPos + | ||
chunk.startPosOnRead + | ||
chunk.length); | ||
readBreakpointSizes.push_back(-1); | ||
readBreakpointTypes.push_back('#'); | ||
readBreakpointsEncounteredM.push_back(chunk.encounteredM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assignBreakpointsAndOverhangs
method correctly interprets the CIGAR chunks to assign breakpoints and overhangs, which are essential for identifying structural variations. The method's logic is sound, and the use of vectors to store related information is appropriate. To further improve code quality, consider encapsulating the logic for handling each CIGAR operation type into separate helper methods or classes, which could enhance modularity and testability.
Consider encapsulating the logic for handling each CIGAR operation type into separate helper methods or classes to enhance modularity and testability.
readType = 1; | ||
return; | ||
} | ||
if (!highQualSa && ((sCounts == 1 && lowQualSacounts == 2) || | ||
(hCounts == 1 && lowQualSacounts == 2) || | ||
(sCounts == 2 && lowQualSacounts == 4))) { | ||
return false; | ||
break; | ||
case 'H': | ||
if (static_cast<ChrSize>(chunk.length) >= CLIPPED_NUCLEOTIDE_COUNT_THRESHOLD) { | ||
hardLongClip = true; | ||
} | ||
block = 0; | ||
mapq = 0; | ||
break; | ||
case 'I': | ||
case 'D': | ||
indelStatus = true; | ||
break; | ||
default: | ||
break; | ||
} | ||
} | ||
} | ||
return true; | ||
} | ||
|
||
double | ||
Alignment::overhangMedianQuality(const CigarChunk &cigarChunk) const { | ||
vector<int> overhangPerBaseQuality{}; | ||
if (!cigarChunk.encounteredM) { | ||
auto startCit = next(samLine.cbegin(), 1 + samChunkPositions[9] + | ||
cigarChunk.startPosOnRead - | ||
cigarChunk.indelAdjustment); | ||
auto endCit = | ||
next(samLine.cbegin(), | ||
1 + samChunkPositions[9] + cigarChunk.startPosOnRead - | ||
cigarChunk.indelAdjustment + cigarChunk.length); | ||
fullMedianQuality(startCit, endCit, overhangPerBaseQuality); | ||
} else { | ||
string::const_reverse_iterator startCrit{ | ||
next(samLine.cbegin(), | ||
1 + samChunkPositions[9] + cigarChunk.startPosOnRead - | ||
cigarChunk.indelAdjustment + cigarChunk.length)}; | ||
string::const_reverse_iterator endCrit{ | ||
next(samLine.cbegin(), 1 + samChunkPositions[9] + | ||
cigarChunk.startPosOnRead - | ||
cigarChunk.indelAdjustment)}; | ||
fullMedianQuality(startCrit, endCrit, overhangPerBaseQuality); | ||
} | ||
if (overhangPerBaseQuality.empty()) { | ||
return -1.0; | ||
} else { | ||
return getMedian(overhangPerBaseQuality.begin(), | ||
overhangPerBaseQuality.end()); | ||
} | ||
} | ||
|
||
void | ||
Alignment::assessReadType() { | ||
/* 0 for non-split, | ||
* 1 for softSplit, | ||
* 2 for hardSplit, | ||
* 3 for indel, | ||
* 4 for distant mate, | ||
* 5 for low quality overhang | ||
* 6 for low quality hardClipped | ||
* (precedence: | ||
* decoy mate> | ||
* low qual > | ||
* soft clips > | ||
* hard clips > | ||
* distant mate > | ||
* indels > | ||
* normal) | ||
*/ | ||
auto hardLongClip = false, indelStatus = false; | ||
for (const auto &chunk : cigarChunks) { | ||
switch (chunk.chunkType) { | ||
case 'S': | ||
if (chunk.length >= CLIPPEDNUCLEOTIDECOUNTTHRESHOLD) { | ||
readType = 1; | ||
return; | ||
} | ||
break; | ||
case 'H': | ||
if (chunk.length >= CLIPPEDNUCLEOTIDECOUNTTHRESHOLD) { | ||
hardLongClip = true; | ||
} | ||
break; | ||
case 'I': | ||
case 'D': | ||
indelStatus = true; | ||
break; | ||
default: | ||
break; | ||
if (hardLongClip) { | ||
readType = 2; | ||
} else if (indelStatus) { | ||
readType = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assessReadType
method categorizes alignments based on the presence of soft clips, hard clips, and indels. This method is crucial for the tool's functionality. The implementation is straightforward and follows logical steps to determine the readType
. To ensure accuracy, consider adding comprehensive unit tests covering various combinations of CIGAR operations and their expected outcomes in terms of readType
.
Would you like assistance in creating comprehensive unit tests for the assessReadType
method?
bool | ||
Alignment::assessOutlierMateDistance() { | ||
switch (distantMate) { | ||
case -1: | ||
return false; | ||
case 1: | ||
return true; | ||
} else { | ||
auto isize_cit = samLine.cbegin() + 1 + samChunkPositions[7]; | ||
if (*isize_cit == '-') { | ||
++isize_cit; | ||
} | ||
auto isize = 0; | ||
for (; isize_cit != samLine.cbegin() + samChunkPositions[8]; | ||
++isize_cit) { | ||
isize = isize * 10 + (*isize_cit - '0'); | ||
} | ||
if (isize > ISIZEMAX) { | ||
default: | ||
if (*(samLine.cbegin() + 1 + static_cast<long>(samTabPositions[5])) != '=') { | ||
distantMate = 1; | ||
return true; | ||
} else { | ||
auto isize_cit = samLine.cbegin() + 1 + static_cast<long>(samTabPositions[7]); | ||
if (*isize_cit == '-') { | ||
++isize_cit; | ||
} | ||
auto isize = 0; | ||
for (; isize_cit != samLine.cbegin() + static_cast<long>(samTabPositions[8]); | ||
++isize_cit) { | ||
isize = isize * 10 + (*isize_cit - '0'); | ||
} | ||
if (isize > ISIZEMAX) { | ||
distantMate = 1; | ||
return true; | ||
} | ||
} | ||
distantMate = -1; | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip
Codebase Verification
The search for ISIZEMAX
across the codebase revealed multiple occurrences, indicating that ISIZEMAX
is indeed defined and used in various parts of the code. Specifically, it is declared as a static double in several classes (Alignment
, SuppAlignment
, SuppAlignmentAnno
) and is used to determine if a mate distance is considered distant. The value of ISIZEMAX
can be set based on insert size parameters or a default value of 2000.0, as seen in sophia.cpp
. This suggests that there is some level of documentation and rationale behind the choice of ISIZEMAX
values, particularly in how it is calculated based on insert size parameters or set to a default value.
However, the search for unit tests specifically targeting the assessOutlierMateDistance
method did not yield any results. This indicates that there might not be any unit tests directly testing this method, or if there are, they might not be named in a way that directly references the method's name.
Given the findings:
- The threshold for considering a mate distance as an outlier (
ISIZEMAX
) is defined and has some level of documentation in the code, particularly in how it is set based on insert size parameters or a default value. - There appears to be a lack of unit tests specifically targeting the
assessOutlierMateDistance
method, based on the search results.
Considering the original review comment's focus on ensuring that the threshold (ISIZEMAX
) is appropriately set and documented, and the suggestion to add unit tests for the assessOutlierMateDistance
method, the findings partially verify the comment. The threshold is indeed set and documented, but the suggestion to add unit tests remains relevant due to the lack of specific tests found.
Analysis chain
The assessOutlierMateDistance
method assesses whether the mate distance is an outlier, which is important for identifying potential structural variations. The method's logic appears to be correct, but ensure that the threshold for considering a mate distance as an outlier (ISIZEMAX
) is appropriately set and documented. Additionally, consider adding unit tests specifically targeting this method to validate its correctness under various scenarios.
Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify the threshold and documentation for considering a mate distance as an outlier in assessOutlierMateDistance.
# This script is a placeholder for manual review and testing.
echo "Review threshold and documentation for outlier mate distance in assessOutlierMateDistance method."
Length of output: 204
Script:
#!/bin/bash
# Search for the definition and documentation of ISIZEMAX
rg "ISIZEMAX" --context 5
# Look for unit tests related to the assessOutlierMateDistance method
rg "assessOutlierMateDistance" --files-with-matches tests/
Length of output: 7670
Commit e65cfbf produces the same Next tests:
After that we have established, that all client code of the Then we can check After that we can use "hg38", which is also implemented with the |
--assemblyname
option, defaulting to SOPHIA 35.0.0's chromosome converter implementation "classic_hg37" when omitted.make
as build systemRelease_*
directories with old build-scripts removedmake static=true boost_lib_dir=/path/to/boost/lib
make develop=true
sophiaMref
(no build documentation before)testRunner
for running unit testsREADME.md
file that is worth its name and contains first documentation about the usage of the SOPHIA binaries and input and output files..editorconfig
file, andclang-format
configurationsophiaAnnotate
the default value for clonalitylofreq was advertised in the usage information as 10, but the actual value was 5. Now, the previously used values (5) is also advertised as default.sophiaMref
by avoiding repeated vector-reallocation up to target size 304 GB.Closes #4, #5, #6, #13
Summary by CodeRabbit