Skip to content

Commit

Permalink
Dag io custom format (#83)
Browse files Browse the repository at this point in the history
- dagbin 
  - Added dag binary file format, as alternative to protobuf.
  - updated input/output file and format options to include dagbin (also, formats can now be inferred by file extension).
- maintenance
  - moved `sample_dag.hpp` to `test_common_dag.hpp`, now contains frequently used dag utilities in addition to sample dags.
  - run_larch_usher function added to `test_common.hpp`.
  - added `data/_ignore` folder, and redirected all test files to that location.
  - updated all timing to use Benchmark class
  - updated CLI options for larch-usher, merge, dag2dot to be more consistent across tools.
  - updated README to clarify changes to CLI.
  - added options to larch-usher to set random seed and number of threads.
  • Loading branch information
davidrich27 authored Jun 13, 2024
1 parent f6fbb9f commit f2c5388
Show file tree
Hide file tree
Showing 39 changed files with 1,761 additions and 410 deletions.
5 changes: 3 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -242,9 +242,10 @@ larch_executable(larch-test
test/test_count_trees.cpp
test/test_dag_completion.cpp
test/test_dag_trimming.cpp
test/test_fileio_dagbin.cpp
test/test_fileio_protobuf.cpp
test/test_larch_usher.cpp
test/test_lca.cpp
test/test_loading.cpp
test/test_map.cpp
test/test_mat_conversion.cpp
test/test_matOptimize.cpp
Expand All @@ -258,7 +259,7 @@ larch_executable(larch-test
test/test_subtree_weight.cpp
test/test_weight_accum.cpp
test/test_weight_counter.cpp
test/test_write_parsimony.cpp
test/test_write_parsimony_protobuf.cpp
)
target_compile_options(larch-test PRIVATE ${STRICT_WARNINGS})

Expand Down
58 changes: 58 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
FROM ubuntu:22.04
# FROM continuumio/miniconda3:latest

# OPTIONS
ARG NUM_THREADS="8"

RUN apt -y update \
&& apt -y upgrade

# Install required/recommended programs
RUN apt -y install --no-install-recommends \
wget \
ssh \
git \
vim \
nano \
perl \
black \
clang-format \
clang-tidy \
less
RUN apt -y install --no-install-recommends \
cmake \
protobuf-compiler \
automake \
autoconf \
libtool \
nasm \
yasm
RUN apt -y install --no-install-recommends \
wget \
git \
ca-certificates \
make \
g++ \
mpi-default-dev \
libboost-dev \
libboost-program-options-dev \
libboost-filesystem-dev \
libboost-date-time-dev \
libboost-iostreams-dev

# Copy repo
WORKDIR /app
COPY . /app

# Install larch
WORKDIR /app
RUN rm -rf /app/build
WORKDIR /app/build
RUN cmake -DCMAKE_BUILD_TYPE=Debug ..
RUN make -j${NUM_THREADS}

# Working directory
WORKDIR /data

# Start a bash shell when the container launches
CMD ["/bin/bash"]
49 changes: 32 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ conda env create -f environment.yml
Building
--------

There are 4 executables that are built automatically as part of the larch package and provide various methods for exploring tree space and manipulating DAGs/trees:
There are 4 executables that are built automatically as part of the larch package and provide various methods for exploring tree space and manipulating DAGs/trees:
- `larch-test` is the suite of tests used to validate the various routines.
- `larch-usher` takes an input tree/DAG and explores tree space through SPR moves.
- `merge` utility is used to manipulate(e.g. combine, prune)DAGs/trees.
Expand Down Expand Up @@ -95,6 +95,16 @@ larch-test options:
- `+tag` includes tests with a given tag.
- For example, the `-tag "slow"` removes tests which require an long runtime to complete.

### file formats

For all tools in this suite, a number of file formats are supported for loading and storing MATs and MADAGs. When passing filepaths as arguments, the file format can be explicitly specified with `--input-format/--output-format` options. Alternatively, the program can infer the file format when filepath contains a recognized file extension.

File format options:
- `MADAG dagbin` Supported as input and output. `*.dagbin` is the recognized extension.
- `MADAG protobuf` Supported as input and output. `*.pb_dag` is the recognized extension, or using `*.pb` WITHOUT a `--MAT-refseq-file` option.
- `MAT protobuf` Supported as input only. `*.pb_tree` is the recognized extension, or using `*.pb` WITH a `--MAT-refseq-file` option.
- `MADAG json` Supported as input only. `*.json_dag` or `*.json` is the recognized extension.

### larch-usher

From the `larch/build/` directory:
Expand All @@ -104,12 +114,12 @@ From the `larch/build/` directory:
This command runs 10 iterations of larch-usher on the provided tree, and writes the final result to the file `output_dag.pb`

larch-usher options:
- `-i,--input` [REQUIRED] The name of the input tree/DAG (accepted file formats are: MADAG protobuf, MAT protobuf, JSON).
- `-o,--output` [REQUIRED] The file path to write the resulting DAG to.
- `-i,--input` [REQUIRED] Filepath to the input tree/DAG (accepted file formats are: MADAG protobuf, MAT protobuf, JSON, Dagbin).
- `-o,--output` [REQUIRED] Filepath to the output tree/DAG (accepted file formats are: MADAG protobuf, Dagbin).
- `-c,--count` [Default: 1] Number of larch-usher iterations to run.
- `-r,--MAT-refseq-file` [REQUIRED if provided input file is a MAT protobuf] Reference sequence file.
- `-v,--VCF-input-file` VCF file containing ambiguous sequence data.
- `-l,--logpath` [Default: `optimization_log`] Filepath to write log to.
- `-r,--MAT-refseq-file` [REQUIRED if provided input file is a MAT protobuf] Filepath to json reference sequence.
- `-v,--VCF-input-file` Filepath to VCF containing ambiguous sequence data.
- `-l,--logpath` [Default: `optimization_log`] Filepath to write summary log.
- `-s,--switch-subtrees` [Default: never] Switch to optimizing subtrees after the specified number of iterations.
- `--min-subtree-clade-size` [Default: 100] The minimum number of leaves in a subtree sampled for optimization (ignored without option `-s`).
- `--max-subtree-clade-size` [Default: 1000] The maximum number of leaves in a subtree sampled for optimization (ignored without option `-s`).
Expand All @@ -122,36 +132,41 @@ larch-usher options:
- `--trim` [Default: do not trim] Trim optimized dag to contain only parsimony-optimal trees before writing to protobuf.
- `--keep-fragment-uncollapsed` [Default: collapse] Do not collapse empty (non-mutation-bearing) edges in the optimization tree.
- `--quiet` [Default: write intermediate files] Do not write intermediate protobuf file at each iteration.
- `--input-format` [Default: format inferred by file extension] Specify the format of the input file. Options are: (`dagbin`, `pb`, `dag-pb`, `tree-pb`, `json`, `dag-json`)
- `--output-format` [Default: format inferred by file extension] Specify the format of the output file. Options are: (`dagbin`, `pb`, `dag-pb`)

### merge

From the `larch/build/` directory:
```shell
./merge -i ../data/testcase/tree1.pb.gz -i ../data/testcase/tree2.pb.gz -d -o merged_trees.pb
./merge -i ../data/testcase/tree_1.pb.gz -i ../data/testcase/tree_2.pb.gz -d -o merged_trees.pb
```
This executable takes a list of protobuf files and merges the resulting DAGs together into one.

merge options:
- `-i,--input` Input protobuf files.
- `-o,--output` [Default: `merged.pb`] Save the output to filename.
- `-r,--refseq` [REQUIRED if input protobufs are MAT protobuf format] Read reference sequence from file.
- `-d,--dag` Input files are MADAG protobuf format\n";
- `-t,--trim` Trim output (default trimming method is trim to best parsimony).
- `--rf` Trim output to minimize RF distance to the provided protobuf(Ignored if `-t` flag is not provided).
- `-i,--input` Filepath to the input Tree/DAG (accepted file formats are: MADAG protobuf, MAT protobuf, JSON, Dagbin).
- `-o,--output` [Default: `merged.dagbin`] Filepath to the output Tree/DAG (accepted file formats are: MADAG protobuf, Dagbin).
- `-r,--MAT-refseq-file` [REQUIRED if input protobufs are MAT protobuf format] Filepath to json reference sequence.
- `-t,--trim` Trim output (Default trimming method is trim to best parsimony).
- `--rf` Trim output to minimize RF distance to the provided DAG file (Ignored if `-t` flag is not provided).
- `-s,--sample` Write a sampled single tree from DAG to file, rather than the whole DAG.
- `--input-format` [Default: format inferred by file extension] Specify the format of the input file(s). Options are: (`dagbin`, `pb`, `dag-pb`, `tree-pb`, `json`, `dag-json`)
- `--output-format` [Default: format inferred by file extension] Specify the format of the output file. Options are: (`dagbin`, `pb`, `dag-pb`)
- `--rf-format` [Default: format inferred by file extension] Specify the format of the RF file. Options are: (`dagbin`, `pb`, `dag-pb`, `tree-pb`, `json`, `dag-json`)

### dag2dot

From the `larch/build/` directory:
```shell
./dag2dot -d ../data/testcase/full_dag.pb
./dag2dot -i ../data/testcase/full_dag.pb
```
This command writes the provided DAG in dot format to stdout.

dag2dot options:
- `-t,--tree-pb` Input MAT protobuf filename.
- `-d,--dag-pb` Input DAG protobuf filename.
- `-j,--dag-json` Input DAG json filename.
- `-i,--input` Filepath to the input Tree/DAG (accepted file formats are: MADAG protobuf, MAT protobuf, JSON, Dagbin).
- `-o,--output` [Default: DOT written to stdout] Filepath to the output DOT file.
- `--input-format` [Default: format inferred by file extension] Specify the format of the input file. Options are: (`dagbin`, `pb`, `dag-pb`, `tree-pb`, `json`, `dag-json`)
- `--dag/--tree` [REQUIRED if file extension is *.pb] Specify whether input file is a DAG or a Tree.


Third-party
Expand Down
4 changes: 4 additions & 0 deletions data/_ignore/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Ignore all files in this directory
*
# But do not ignore this .gitignore file
!/.gitignore
Binary file added data/big_test/big_test.pb.gz
Binary file not shown.
1 change: 0 additions & 1 deletion data/data

This file was deleted.

120 changes: 111 additions & 9 deletions include/larch/benchmark.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,42 +5,144 @@
class Benchmark {
public:
using TimePoint = decltype(std::chrono::high_resolution_clock::now());
using Us = std::chrono::microseconds;
using Ms = std::chrono::milliseconds;
using S = std::chrono::seconds;

inline Benchmark(bool start_on_init = true);

inline void start();
inline void stop();

template <typename time_scale>
inline auto lap();
inline auto lapUs();
inline auto lapMs();
inline auto lapS();

template <typename time_scale>
inline std::string lapFormat();
inline std::string lapFormatUs();
inline std::string lapFormatMs();
inline std::string lapFormatS();

template <typename time_scale>
inline auto duration() const;
inline auto durationUs() const;
inline auto durationMs() const;
inline auto durationS() const;

template <typename time_scale>
inline std::string durationFormat() const;
inline std::string durationFormatUs() const;
inline std::string durationFormatMs() const;
inline std::string durationFormatS() const;

template <typename time_scale>
inline static std::string format(long int ticks);
inline static std::string formatUs(long int ticks);
inline static std::string formatMs(long int ticks);
inline static std::string formatS(long int ticks);

private:
TimePoint start_;
TimePoint stop_;
};

///////////////////////////////////////////////////////////////////////////////

Benchmark::Benchmark(bool start_on_init) {
if (start_on_init) {
start();
}
}

void Benchmark::start() { start_ = std::chrono::high_resolution_clock::now(); }

void Benchmark::stop() { stop_ = std::chrono::high_resolution_clock::now(); }

auto Benchmark::durationUs() const {
return std::chrono::duration_cast<std::chrono::microseconds>(stop_ - start_).count();
template <typename time_scale>
auto Benchmark::duration() const {
return std::chrono::duration_cast<time_scale>(stop_ - start_).count();
}

auto Benchmark::durationUs() const { return duration<std::chrono::microseconds>(); }

auto Benchmark::durationMs() const { return duration<std::chrono::milliseconds>(); }

auto Benchmark::durationS() const { return duration<std::chrono::seconds>(); }

template <typename time_scale>
std::string Benchmark::durationFormat() const {
return format<time_scale>(duration<time_scale>());
}

std::string Benchmark::durationFormatUs() const {
return durationFormat<std::chrono::microseconds>();
}

auto Benchmark::durationMs() const {
return std::chrono::duration_cast<std::chrono::milliseconds>(stop_ - start_).count();
std::string Benchmark::durationFormatMs() const {
return durationFormat<std::chrono::milliseconds>();
}

auto Benchmark::durationS() const {
return std::chrono::duration_cast<std::chrono::seconds>(stop_ - start_).count();
std::string Benchmark::durationFormatS() const {
return durationFormat<std::chrono::seconds>();
}

auto Benchmark::lapMs() {
template <typename time_scale>
auto Benchmark::lap() {
stop();
auto result = durationMs();
auto result = duration<time_scale>();
start_ = stop_ = std::chrono::high_resolution_clock::now();
return result;
}
}

auto Benchmark::lapUs() { return lap<std::chrono::microseconds>(); }

auto Benchmark::lapMs() { return lap<std::chrono::milliseconds>(); }

auto Benchmark::lapS() { return lap<std::chrono::seconds>(); }

template <typename time_scale>
std::string Benchmark::lapFormat() {
return format<time_scale>(lap<time_scale>());
}

std::string Benchmark::lapFormatUs() { return lapFormat<std::chrono::microseconds>(); }

std::string Benchmark::lapFormatMs() { return lapFormat<std::chrono::milliseconds>(); }

std::string Benchmark::lapFormatS() { return lapFormat<std::chrono::seconds>(); }

template <typename time_scale>
std::string Benchmark::format(long int ticks) {
long int ticks_per_second;
if constexpr (std::is_same_v<time_scale, std::chrono::seconds>) {
ticks_per_second = 1;
} else if constexpr (std::is_same_v<time_scale, std::chrono::milliseconds>) {
ticks_per_second = 1000;
} else if constexpr (std::is_same_v<time_scale, std::chrono::microseconds>) {
ticks_per_second = 1000000;
} else {
static_assert(!std::is_same_v<time_scale, time_scale>, "ERROR: Unsupported type.");
}

std::stringstream ss;
auto min = ticks / (60 * ticks_per_second);
auto sec = static_cast<double>(ticks % (60 * ticks_per_second)) /
static_cast<double>(ticks_per_second);
ss << min << "m" << sec << "s";
return ss.str();
}

std::string Benchmark::formatUs(long int ticks) {
return format<std::chrono::microseconds>(ticks);
}

std::string Benchmark::formatMs(long int ticks) {
return format<std::chrono::milliseconds>(ticks);
}

std::string Benchmark::formatS(long int ticks) {
return format<std::chrono::seconds>(ticks);
}
Loading

0 comments on commit f2c5388

Please sign in to comment.