Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formalized the NumPy + Numba solution for ordinal mapper #209

Merged
merged 12 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ jobs:
run: pycodestyle .

- name: Run unit tests
run: coverage run -m unittest && coverage lcov
run: |
export NUMBA_DISABLE_JIT=1
coverage run -m unittest && coverage lcov

- name: Coveralls
uses: coverallsapp/github-action@v2
Expand Down
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# Change Log

## Version 0.1.6-dev
## Version 0.1.7-dev

### Added
- Formally adopted the NumPy + Numba solution in the ordinal mapper. This significantly accelerated the algorithm ([#209](https://github.com/qiyunzhu/woltka/pull/209)).

### Changed
- Changed default output subject coverage (`--outcov`) coordinates into BED-like (0-based, exclusive end). The output can be directly parsed by programs like bedtools. Also added support for GFF-like and other custom formats, as controled by paramter `--outcov-fmt` ([#204](https://github.com/qiyunzhu/woltka/pull/204) and [#205](https://github.com/qiyunzhu/woltka/pull/205)).
- Default chunk size is now 1024 for plain and range mapeprs, and 2 ** 20 = 1048576 for ordinal mapper. The latter represents the number of valid query-subject pairs ([#209](https://github.com/qiyunzhu/woltka/pull/209)).


## Version 0.1.6 (2/22/2024)
Expand Down
3 changes: 2 additions & 1 deletion ci/conda_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
cython
numpy
numba
biom-format
4 changes: 2 additions & 2 deletions doc/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,8 @@ Option | Description

Option | Description
--- | ---
`--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,000 for plain or range mapping, or 1,000,000 for ordinal mapping.
`--cache` | Number of recent classification results to cache for faster subsequent classifications. Default: 1024.
`--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,024 for plain or range mapping, or 2 ** 20 = 1,048,576 for ordinal mapping. The latter cannot exceed 2 ** 22.
`--cache` | Number of recent classification results to cache for faster subsequent classifications. Default: 1,024.
`--no-exe` | Disable calling external programs (`gzip`, `bzip2` and `xz`) for decompression. Otherwise, Woltka will use them if available for faster processing, or switch back to Python if not.


Expand Down
2 changes: 1 addition & 1 deletion doc/perform.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Simple read-gene matching, with Numba [acceleration](install.md#acceleration) |

Two Woltka parameters visibly impacts Woltka's speed:

- `--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,000 for plain or range mapping, or 1,000,000 for ordinal mapping.
- `--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,024 for plain or range mapping, or 2 ** 20 = 1,048,576 for ordinal mapping. The latter cannot exceed 2 ** 22.
- `--cache` | Number of recent classification results to cache for faster subsequent classifications. Default: 1024.

Their default values were set based on our experience. However, alternative values could improve (or reduce) performance depending on the computer hardware, input file type, and database capacity. if you plan to routinely process bulks of biological big data using the same setting, we recommend that you do a few test runs on a small dataset and find out the values that work the best for you.
Expand Down
56 changes: 1 addition & 55 deletions woltka/align.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------


"""Functions for parsing alignment / mapping files.

Notes
Expand Down Expand Up @@ -45,7 +44,7 @@
from functools import lru_cache


def plain_mapper(fh, fmt=None, excl=None, n=1000):
def plain_mapper(fh, fmt=None, excl=None, n=1024):
"""Read an alignment file in chunks and yield query-to-subject(s) maps.

Parameters
Expand Down Expand Up @@ -619,59 +618,6 @@ def cigar_to_lens_ord(cigar):
return align, align + offset


def parse_sam_file_pd(fh, n=65536):
"""Parse a SAM file (sam) using Pandas.

Parameters
----------
fh : file handle
SAM file to parse.
n : int, optional
Chunk size.

Yields
------
None

Notes
-----
This is a SAM file parser using Pandas. It is slower than the current
parser. The `read_csv` is fast, but the data frame manipulation slows
down the process. It is here for reference only.
"""
return
# with pd.read_csv(fp, sep='\t',
# header=None,
# comment='@',
# na_values='*',
# usecols=[0, 1, 2, 3, 5],
# names=['qname', 'flag', 'rname', 'pos', 'cigar'],
# dtype={'qname': str,
# 'flag': np.uint16,
# 'rname': str,
# 'pos': int,
# 'cigar': str},
# chunksize=n) as reader:
# for chunk in reader:
# chunk.dropna(subset=['rname'], inplace=True)
# # this is slow, because of function all
# chunk['length'], offset = zip(*chunk['cigar'].apply(
# cigar_to_lens))
# chunk['right'] = chunk['pos'] + offset
# # this is slow, because of function all
# # chunk['qname'] = chunk[['qname', 'flag']].apply(
# # qname_by_flag, axis=1)
# # this is a faster method
# chunk['qname'] += np.where(
# chunk['qname'].str[-2:].isin(('/1', '/2')), '',
# np.where(np.bitwise_and(chunk['flag'], (1 << 6)), '/1',
# np.where(np.bitwise_and(chunk['flag'], (1 << 7)),
# '/2', '')))
# chunk['score'] = 0
# yield from chunk[['qname', 'rname', 'score', 'length',
# 'pos', 'right']].values


def parse_map_file(fh, *args):
"""Parse a simple mapping file.

Expand Down
4 changes: 2 additions & 2 deletions woltka/biom.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,8 +164,8 @@ def round_biom(table: biom.Table, digits=0):
digits : int, optional
Digits after the decimal point.

Notes
-----
Examples
--------
There is a fully vectorized, much faster alternate:

>>> arr = table.matrix_data.data
Expand Down
Loading