Skip to content

Commit

Permalink
Merge tag 'v0.2.0' into ci
Browse files Browse the repository at this point in the history
v0.2.0
  • Loading branch information
nromashchenko committed Feb 7, 2024
2 parents e97bf9f + a630778 commit ed82b29
Show file tree
Hide file tree
Showing 6 changed files with 122 additions and 82 deletions.
27 changes: 7 additions & 20 deletions CHANGELOG.txt
Original file line number Diff line number Diff line change
@@ -1,23 +1,10 @@
============================================================================================================
v0.1.3
- Changed the database name convention (omega parameter)
- PhyML parser rolled back to slow and naive solution due to breaking updates in last PhyML versions
v0.2.0

============================================================================================================
v0.1.2
- The database name convention has changed to be consistent to the latest versions of RAPPASv1 (java)
- The default omega value now is 1.5
- Updated core to v0.1.7, which supports ambiguous bases

============================================================================================================
v0.1.1
The pre-publication release intended to be the reference version for the further improvements. Previous releases are left mostly for history and are not supposed to be used.

- New construction algorithm: now the construction is done by completing two steps.
In the first one, hashmaps of group nodes are built independently and stored on disk in ${workdir}/hashmaps.
After, they are merged into a database hashmap.
- AR Probability matrix is destroyed after the first stage of the algorithm.
It makes the peak RAM consumption approximately to be max(proba matrix size, final database size).

============================================================================================================
v0.1.0
First released version.
- Supported partial loading of databases with phylo-k-mer filtering (--mu, --max-ram)
- Reintroduced parallelism (--threads)
- Reworked LWR formula so that it is normalized over all branches of the tree
- Various improvements and fixes
- Removed the old changelog since it was related to the functionality that had migrated to IPK
55 changes: 51 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
# EPIK: Evolutionary Placement with Informative K-mers


[![build](https://github.com/phylo42/EPIK/actions/workflows/build.yml/badge.svg)](https://github.com/phylo42/EPIK/actions/workflows/build.yml)
<a>
<img src="https://img.shields.io/badge/softwipe-7.6-green" />
</a>
EPIK is a program for rapid alignment-free phylogenetic placement, the successor of [RAPPAS](https://github.com/phylo42/RAPPAS).

## Installation
Expand Down Expand Up @@ -26,16 +33,56 @@ cmake ..
make -j4
```

### Install
You can use `epik.py` from the directory where it was built or install it system-wide or for a single user to make `epik.py` visible from any directory.

For a system-wide installation (requires elevated permissions):
```
sudo cmake --install .
```

Alternatively, to install for the current user, choose a directory where you want to install the tool. For instance, you might choose `/home/$USER/opt` or any other directory that you prefer. Replace `DIRECTORY` in the commands below with your chosen directory path:

```
cmake --install . --prefix DIRECTORY
export PATH=DIRECTORY/bin:$PATH
```
Remember to export the `DIRECTORY/bin` to your `PATH`. You can do this manually each time or add the export command to your shell initialization scripts (e.g., `.bashrc`).


## Usage


### Phylogenetic placement
To place queries to a phylogenetic tree, you need to first preprocess it with IPK and make a phylo-k-mer database (see [here](https://github.com/phylo42/IPK) for detail). Queries should be in non-compressed fasta format. An example of placement command (see below for possible parameters values):
```
python epik.py place -i DATABASE -s [nucl|amino] -o OUTPUT_DIR INPUT_FASTA
epik.py place -i DATABASE -s [nucl|amino] -o OUTPUT_DIR INPUT_FASTA
```
See `python epik.py place --help` for more information.
If EPIK is not installed, run `./epik.py` from the EPIK directory instead.

### Parameters

| Option | Meaning | Default |
|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
| -i | The path to the phylo-k-mer database to use for placement. | |
| -s | States, `nucl` for DNA and `amino` for proteins | nucl |
| --omega | The user-defined threshold. Can be set higher than the one used when database was created. (If you are not sure, ignore this parameter.) | 1.5 |
| --mu | The proportion of the database to keep when filtering. Mutually exclusive with `--max-ram`. Should be a value in (0.0, 1.0] | 1.0 |
| --max-ram | The maximum amount of memory used to keep the database content. Mutually exclusive with `--mu`. Sets an approximate limit to EPIK's RAM consumption. | |
| --threads | Number of parallel threads used for placement. EPIK should be compiled with OpenMP support enabled, i.e. `EPIK_OMP=ON`. (If you compile as we recommend, it is enabled) | 1 |

Also, see `epik.py place --help` for information.


### Building databases
## Other

### Code quality

Code quality evaluation with [softwipe](https://github.com/adrianzap/softwipe) [1]:
```
softwipe --cmake --cpp -x third-party,i2l/third-party,i2l/tests/catch2,i2l/examples --no-execution .
```

To compute phylo-k-mer databases, use [IPK](https://github.com/phylo42/IPK).

## References
[1] A. Zapletal, D. Höhler, C. Sinz, A. Stamatakis (2021) The SoftWipe tool and benchmark for assessing coding standards adherence of scientific software Sci Rep 11, 10015 (2021). https://doi.org/10.1038/s41598-021-89495-8
11 changes: 9 additions & 2 deletions epik.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@
import subprocess


__version__ = "0.2.0"


@click.group()
@click.version_option(__version__)
def epik():
"""
EPIK: Evolutionary Placement with Informative K-mers
Expand Down Expand Up @@ -46,7 +50,7 @@ def epik():
help="Output directory.")
@click.option('--threads',
type=int,
default=4, show_default=True,
default=1, show_default=True,
help="Number of threads used.")
@click.option('--max-ram',
type=str,
Expand All @@ -57,7 +61,10 @@ def place(database, states, omega, mu, outputdir, threads, max_ram, input_file):
"""
Places .fasta files using the input IPK database.
\tpython epik.py place -s [nucl|amino] -i db.ipk -o output file.fasta [file2.fasta ...]
epik.py place -s [nucl|amino] -i DB.ipk -o output file.fasta [file2.fasta ...]
Examples:
\tepik.py place -i DB.ipk -o temp --max-ram 4G --threads 8 query.fasta
"""
place_queries(database, states, omega, mu, outputdir, threads, max_ram, input_file)
Expand Down
78 changes: 53 additions & 25 deletions epik/src/epik/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#include <sstream>
#include <iomanip>
#include <cctype>
#include <cmath>
#include <stdexcept>
#include <boost/filesystem.hpp>
#include <cxxopts.hpp>
Expand Down Expand Up @@ -63,29 +64,51 @@ void print_intruction_set()

/// Float-to-humanized string for better output
template<typename T>
std::string humanize(T num)
std::string to_human_readable(T num)
{
std::ostringstream oss;
oss.precision(1);

if (num < 1000.0)
if (num < 1024)
{
oss << std::fixed << num;
}
else if (num < 1000000.0)
{
oss << std::fixed << num / 1000.0 << "K";
}
else if (num < 1000000000.0)
{
oss << std::fixed << num / 1000000.0 << "M";
}
else
{
oss << std::fixed << num / 1000000000.0 << "B";
double value;
std::string suffix;

if (num < 1024 * 1024)
{
value = num / 1024.0;
suffix = "K";
}
else if (num < 1024 * 1024 * 1024)
{
value = num / (1024.0 * 1024.0);
suffix = "M";
}
else
{
value = num / (1024.0 * 1024.0 * 1024.0);
suffix = "B";
}

// Check if the fractional part is zero
double int_part;
double frac_part = std::modf(value, &int_part);
if (frac_part == 0.0)
{
oss << static_cast<long long>(int_part) << suffix;
}
else
{
oss.precision(1);
oss << std::fixed << value << suffix;
}
}

return oss.str();

}

/// Size_t-to-string that translates milliseconds to humanized time
Expand Down Expand Up @@ -128,9 +151,9 @@ std::string humanize_time(size_t milliseconds)
return oss.str();
}

/// Parse the humanized RAM size to a number.
/// I know that the name of this function is unfortunate.
size_t dehumanize_ram(const std::string& max_ram)
/// Parse a human-readable --max-ram value as the number of bytes
/// e.g. 128K, 50M, 4.2Gb etc.
size_t parse_human_readable(const std::string& max_ram)
{
double value;
char unit = 0;
Expand All @@ -140,7 +163,7 @@ size_t dehumanize_ram(const std::string& max_ram)
ss >> value;
if (ss.fail())
{
throw std::runtime_error("Can't parse max_ram parameter: wrong numerical part");
throw std::runtime_error("Could not parse --max-ram parameter: wrong numerical part");
}

// Check if there is a memory unit
Expand All @@ -149,7 +172,7 @@ size_t dehumanize_ram(const std::string& max_ram)
ss >> unit;
if (ss.fail())
{
throw std::runtime_error("Can't parse max_ram parameter: wrong unit");
throw std::runtime_error("Could not parse --max-ram parameter: wrong unit");
}
}

Expand Down Expand Up @@ -230,10 +253,15 @@ int main(int argc, char** argv)
if (parsed_options.count("max-ram"))
{
const auto max_ram_string = parsed_options["max-ram"].as<std::string>();
const auto max_ram = dehumanize_ram(max_ram_string);
max_entries = max_ram / sizeof(i2l::pkdb_value);
const auto max_ram = parse_human_readable(max_ram_string);
max_entries = static_cast<size_t>(max_ram / sizeof(i2l::pkdb_value));

if (max_entries == 0)
{
throw std::runtime_error("Memory limit is too low");
}
std::cout << "Max-RAM provided: will be loaded not more than "
<< humanize(max_entries) << " phylo-k-mers." << std::endl;
<< to_human_readable(max_entries) << " phylo-k-mers." << std::endl;
}

#ifndef EPIK_OMP
Expand All @@ -259,8 +287,8 @@ int main(int argc, char** argv)
<< "\tk: " << db.kmer_size() << std::endl
<< "\tomega: " << db.omega() << std::endl
<< "\tPositions loaded: " << (db.positions_loaded() ? "true" : "false") << std::endl << std::endl;
std::cout << "Loaded " << humanize(db.get_num_entries_loaded())
<< " of " << humanize(db.get_num_entries_total())
std::cout << "Loaded " << to_human_readable(db.get_num_entries_loaded())
<< " of " << to_human_readable(db.get_num_entries_total())
<< " phylo-k-mers. " << std::endl << std::endl;

const auto tree = i2l::io::parse_newick(db.tree());
Expand Down Expand Up @@ -325,7 +353,7 @@ int main(int argc, char** argv)
average_speed += seq_per_second;

// Update progress bar
bar.set_option(option::PrefixText{humanize(seq_per_second) + " seq/s "});
bar.set_option(option::PrefixText{to_human_readable(seq_per_second) + " seq/s "});
bar.set_option(option::PostfixText{std::to_string(num_seq_placed) + " / ?"});
bar.set_progress(reader.bytes_read());

Expand All @@ -339,12 +367,12 @@ int main(int argc, char** argv)

average_speed /= (double)num_iterations;
bar.set_option(option::PrefixText{"Done. "});
bar.set_option(option::PostfixText{std::to_string(num_seq_placed)});
bar.set_option(option::PostfixText{to_human_readable(num_seq_placed)});
bar.set_progress(reader.bytes_read());

std::cout << std::endl << termcolor::bold << termcolor::white
<< "Placed " << num_seq_placed << " sequences.\nAverage speed: "
<< humanize(average_speed) << " seq/s.\n";
<< to_human_readable(average_speed) << " seq/s.\n";
std::cout << "Output: " << jplace_filename << std::endl;

const auto placement_time = std::chrono::duration_cast<std::chrono::milliseconds>(
Expand Down
2 changes: 2 additions & 0 deletions epik/src/epik/place.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,8 @@ std::vector<placement> filter_by_ratio(const std::vector<placement>& placements,

placed_collection placer::place(const std::vector<seq_record>& seq_records, size_t num_threads)
{
(void)num_threads;

/// There may be identical sequences with different headers. We group them
/// by the sequence content to not to place the same sequences more than once
const auto sequence_map = group_by_sequence_content(seq_records);
Expand Down
31 changes: 0 additions & 31 deletions scripts/config.json

This file was deleted.

0 comments on commit ed82b29

Please sign in to comment.