Skip to content

Commit

Permalink
Major refactoring to have a map-based hg38 index/converter.
Browse files Browse the repository at this point in the history
* Added Makefile to replace build scripts.

Signed-off-by: Philip R. Kensche <p.kensche@dkfz.de>
  • Loading branch information
vinjana committed Dec 15, 2023
1 parent f4f4dce commit cbcf2dd
Show file tree
Hide file tree
Showing 61 changed files with 4,706 additions and 1,125 deletions.
811 changes: 811 additions & 0 deletions GRCh38.contigInformation.selection.tsv

Large diffs are not rendered by default.

3,368 changes: 3,368 additions & 0 deletions GRCh38.contigInformation.tsv

Large diffs are not rendered by default.

104 changes: 104 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Compiler
CXX = x86_64-conda_cos6-linux-gnu-g++

INCLUDE_DIR = ./include
BUILD_DIR = ./build
SRC_DIR = ./src


INCLUDE_FLAGS= -Iinclude -I$$CONDA_PREFIX/include
LIBDIR_FLAGS = -L$$CONDA_PREFIX/lib

# Compiler flags
CXXFLAGS = -O3 -Wall -Wextra -static -static-libgcc -static-libstdc++ -flto -c -fmessage-length=0 -Wno-attributes
ifeq ($(STATIC), "true")
CXXFLAGS := -static -static-libgcc -static-libstdc++ $(CXXFLAGS)
endif
CXXFLAGS := $(LIBDIR_FLAGS) -std=c++17 $(INCLUDE_FLAGS) $(CXXFLAGS)

# Source files
SOURCES = $(wildcard src/*.cpp)

# Object files should have .o instead of .cpp.
OBJECTS = $(SOURCES:.cpp=.o)

# Binaries
BINARIES = sophia sophiaAnnotate sophiaMref

# Default rule
all: $(BINARIES)

$(BUILD_DIR):
mkdir -p $@

# Rule for object files
$(BUILD_DIR)/%.o: %.cpp | $(BUILD_DIR)
$(CXX) $(INCLUDE_FLAGS) $(LIBDIR_FLAGS) $(CXXFLAGS) -c $< -o $@

download_strtk: include/strtk.hpp
wget -c https://github.com/ArashPartow/strtk/raw/master/strtk.hpp -O include/strtk.hpp

vpath %.h includes
vpath %.cpp src

# Rule for sophia
sophia: $(BUILD_DIR)/Alignment.o \
$(BUILD_DIR)/Breakpoint.o \
$(BUILD_DIR)/ChosenBp.o \
$(BUILD_DIR)/ChrConverter.o \
$(BUILD_DIR)/Hg37ChrConverter.o \
$(BUILD_DIR)/Hg38ChrConverter.o \
$(BUILD_DIR)/SamSegmentMapper.o \
$(BUILD_DIR)/Sdust.o \
$(BUILD_DIR)/SuppAlignment.o \
$(BUILD_DIR)/HelperFunctions.o \
$(BUILD_DIR)/GlobalAppConfig.o \
$(BUILD_DIR)/sophia.o \
download_strtk
$(CXX) $(LIBDIR_FLAGS) -lboost_program_options -o $@ $^

# Rule for sophiaAnnotate
sophiaAnnotate: $(BUILD_DIR)/AnnotationProcessor.o \
$(BUILD_DIR)/Breakpoint.o \
$(BUILD_DIR)/BreakpointReduced.o \
$(BUILD_DIR)/ChrConverter.o \
$(BUILD_DIR)/Hg37ChrConverter.o \
$(BUILD_DIR)/Hg38ChrConverter.o \
$(BUILD_DIR)/DeFuzzier.o \
$(BUILD_DIR)/GermlineMatch.o \
$(BUILD_DIR)/MrefEntry.o \
$(BUILD_DIR)/MrefEntryAnno.o \
$(BUILD_DIR)/MrefMatch.o \
$(BUILD_DIR)/SuppAlignment.o \
$(BUILD_DIR)/SuppAlignmentAnno.o \
$(BUILD_DIR)/SvEvent.o \
$(BUILD_DIR)/HelperFunctions.o \
$(BUILD_DIR)/GlobalAppConfig.o \
$(BUILD_DIR)/sophiaAnnotate.o \
download_strtk
$(CXX) $(LIBDIR_FLAGS) -lz -lboost_system -lboost_iostreams $(CXXFLAGS) -o $@ $^

# Rule for sophiaMref
sophiaMref: $(BUILD_DIR)/GlobalAppConfig.o \
$(BUILD_DIR)/ChrConverter.o \
$(BUILD_DIR)/Hg37ChrConverter.o \
$(BUILD_DIR)/Hg38ChrConverter.o \
$(BUILD_DIR)/HelperFunctions.o \
$(BUILD_DIR)/SuppAlignment.o \
$(BUILD_DIR)/SuppAlignmentAnno.o \
$(BUILD_DIR)/MrefEntry.o \
$(BUILD_DIR)/MrefEntryAnno.o \
$(BUILD_DIR)/MrefMatch.o \
$(BUILD_DIR)/MasterRefProcessor.o \
$(BUILD_DIR)/Breakpoint.o \
$(BUILD_DIR)/BreakpointReduced.o \
$(BUILD_DIR)/GermlineMatch.o \
$(BUILD_DIR)/DeFuzzier.o \
$(BUILD_DIR)/sophiaMref.o \
download_strtk
$(CXX) $(LIBDIR_FLAGS) -lz -lboost_system -lboost_iostreams -lboost_program_options $(CXXFLAGS) -o $@ $^

# Rule for clean
.PHONY: clean
clean:
rm -f $(OBJECTS) $(BINARIES)
31 changes: 19 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,40 +25,47 @@ You can cite Sophia as follows:
Umut Toprak (2019).
DOI 10.11588/heidok.000274296

### Tools

* `sophia` - The main tool for SV calling. It takes a BAM file as input and outputs a list of SVs in mref format.
* `sophiaAnnotate` - Tool for annotating SVs with gene information. It reads in an mref file created by `sophiaMref` and annotates the SVs in the input file with gene information.
* `sophiaMref` - The `sophiaMref` tool processes a list of gzipped control bed files and generates a reference that can be used by `sophiaAnnotate` for annotating structural variants with gene information.

For instructions on commandline parameters, invoke the tool with `--help`.

## Runtime Dependencies

The only dependency is Boost 1.70.0 (currently). E.g. you can do
The only dependency is Boost 1.82.0 (currently). E.g. you can do

```bash
conda create -n sophia boost=1.70.0
conda create -n sophia boost=1.82.0
```

## Building

### Build-time Dependencies

* g++ >= 7
* Boost 1.70.0

### Dynamic Build

With Conda you can do

```bash
conda create -n sophia gxx_linux-64=8 boost=1.70.0
conda create -n sophia gxx_linux-64=8 boost=1.82.0
```

to create an environment to build the `sophia` and `sophiaAnnotate` binaries.
to create an environment to build the SOPHIA binaries binaries.

To build you need to do

```bash
source activate sophia
cd Release_sophia
build-sophia.sh

cd Release_sophiaMref
./build-sophiaMref.sh

cd ../Release_sophia
./build-sophia.sh

cd ../Release_sophiaAnnotate
build-sophiaAnnotate.sh
./build-sophiaAnnotate.sh
```

Note that the build-scripts are for when you manage your dependencies with Conda.
Expand Down
59 changes: 30 additions & 29 deletions include/ChrConverter.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,51 +26,52 @@

namespace sophia {

using namespace std;
// These two are only to make the code clearer, but are not type checked. There are no opaque
// or strongly type-checked typedefs in C++17.
typedef size_t ChrIndex;
typedef size_t CompressedMrefIndex;

/** ChrConverter contains information the names of chromosomes in an assembly. */
class ChrConverter {
protected:

/** The constructor should be used to initialize the fields from subclasses. It does
additional checks of the dimensions of the input vectors. */
ChrConverter(const vector<string>& indexToChr,
const vector<string>& indexToChrCompressedMref,
const vector<int>& chrSizesCompressedMref,
const vector<int>& indexConverter);
/** ChrConverter manages information on chromosomes names, sizes, and index positions in
data arrays.
This probably needs a redesign. The current situation is just an intermediate step away
from the former completely procedural implementation that heavily leaked implementation
details into the calling code and was highly tuned but very inflexible. */
class ChrConverter {

public:

virtual ~ChrConverter();

/** The name of the assembly. */
static const string assembly_name;
static const std::string assemlyName;

/** Mapping indices to chromosome names. */
const vector<string> indexToChr;
/** Number of chromosomes. */
virtual int nChromosomes() const = 0;

/** Mapping indices to chromosome names for compressed mref files. */
const vector<string> indexToChrCompressedMref;
/** Number of compressed mref chromosomes. */
virtual int nChromosomesCompressedMref() const = 0;

/** Chromosome sizes in base pairs. */
const vector<int> chrSizesCompressedMref;
/** Map an index position to a chromosome name. */
virtual std::string indexToChrName(ChrIndex index) const = 0;

/** Mapping chromosome names to indices. */
const vector<int> indexConverter;
/** Map an index position to a chromosome name for compressed mref files. */
virtual std::string indexToChrNameCompressedMref(CompressedMrefIndex index) const = 0;

/** Parse chromosome index. It takes a position in a character stream, and translates the
following character(s) into index positions (using ChrConverter::indexToChr). */
virtual int readChromosomeIndex(string::const_iterator startIt, char stopChar) const = 0;
/** Map the compressed mref index to the uncompressed mref index. */
virtual ChrIndex compressedMrefIndexToIndex(CompressedMrefIndex index) const = 0;

size_t n_chromosomes() {
return indexToChr.size();
};
/** Map compressed mref index to chromosome size. */
virtual int chrSizeCompressedMref(CompressedMrefIndex index) const = 0;

size_t n_chromosomes_compressed_mref() {
return indexToChrCompressedMref.size();
};
/** Map a chromosome name to an index position. */
virtual ChrIndex chrNameToIndex(std::string chrName) const = 0;

/** Parse chromosome index. It takes a position in a character stream, and translates the
following character(s) into index positions (using ChrConverter::indexToChr).
If the name cannot be parsed, throws a domain_error exception. */
virtual ChrIndex parseChrAndReturnIndex(std::string::const_iterator startIt,
char stopChar) const = 0;

};

Expand Down
6 changes: 3 additions & 3 deletions include/GlobalAppConfig.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,12 @@ namespace sophia {

protected:

GlobalAppConfig(unique_ptr<ChrConverter const> chrConverter);
GlobalAppConfig(std::unique_ptr<ChrConverter const> chrConverter);

~GlobalAppConfig();

/** The chromosome converter. */
const unique_ptr<ChrConverter const> chrConverter;
const std::unique_ptr<ChrConverter const> chrConverter;

public:

Expand All @@ -56,7 +56,7 @@ namespace sophia {
void operator=(const GlobalAppConfig &) = delete;

/** Factory method. */
static GlobalAppConfig &init(unique_ptr<ChrConverter const> chrConverter);
static GlobalAppConfig &init(std::unique_ptr<ChrConverter const> chrConverter);

/** Getter. */
static const GlobalAppConfig &getInstance();
Expand Down
51 changes: 47 additions & 4 deletions include/Hg37ChrConverter.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,59 @@

namespace sophia {

using namespace std;

/** Hard-coded chromosome converter for hg37. This tries to encapsulate the implementation
details of the original version. */
class Hg37ChrConverter: public ChrConverter {
protected:

/** The constructor does additional checks of the dimensions of the input vectors. */
Hg37ChrConverter(const std::vector<std::string>& indexToChr,
const std::vector<std::string>& indexToChrCompressedMref,
const std::vector<CompressedMrefIndex>& chrSizesCompressedMref,
const std::vector<ChrIndex>& indexConverter);

/** Mapping indices to chromosome names. */
const std::vector<std::string> indexToChr;

/** Mapping indices to chromosome names for compressed mref indices. */
const std::vector<std::string> indexToChrCompressedMref;

/** Chromosome sizes in base pairs, only for compressed mref chromosomes. */
const std::vector<CompressedMrefIndex> chrSizesCompressedMref;

/** Mapping compressed mref indices names to indices. */
const std::vector<ChrIndex> indexConverter;

public:

static const string assembly_name;
static const std::string assemblyName;

Hg37ChrConverter();

int readChromosomeIndex(string::const_iterator startIt, char stopChar) const;
/** Return the number of chromosomes. */
int nChromosomes() const;

/** Number of compressed mref chromosomes. */
int nChromosomesCompressedMref() const;

/** Map an index position to a chromosome name. */
std::string indexToChrName(ChrIndex index) const;

/** Map an index position to a chromosome name for compressed mref files. */
std::string indexToChrNameCompressedMref(CompressedMrefIndex index) const;

/** Map the compressed mref index to the uncompressed mref index. */
ChrIndex compressedMrefIndexToIndex(CompressedMrefIndex index) const;

/** Map compressed mref index to chromosome size. */
int chrSizeCompressedMref(CompressedMrefIndex index) const;

/** Map a chromosome name to an index position for compressed mref files. */
CompressedMrefIndex chrNameToIndexCompressedMref(std::string chrName) const;

/** Parse chromosome name given a iterator (start) and termination character.
Validate against the pre-declared chromosome names. */
ChrIndex parseChrAndReturnIndex(std::string::const_iterator startIt, char stopChar) const;

};

Expand Down
Loading

0 comments on commit cbcf2dd

Please sign in to comment.