Skip to content

Commit

Permalink
Chunked dedup (#107)
Browse files Browse the repository at this point in the history
* cKDTree implementation of chunked deduplication, twp backends: scipy and sklearn

* traditional cython backend: duplication is not transitive with non-zero max mismatch, while it's fixed in the scipy/sklearn

* documentation of chunked deduplication

* --carryover for proper dedup of chunks for scipy and sklearn

* cython version revived; parentID added to cython.

* Fix parse "-" for stdin

* multiple refactoring of pairtools stats and dedup

* Up lowest python to 3.7

in travis

* Up lowest python in python-package

* Require python>=3.7

* Some formatting and minor docs changes
  • Loading branch information
Phlya authored Apr 4, 2022
1 parent 333a2bb commit 7e4712f
Show file tree
Hide file tree
Showing 13 changed files with 1,333 additions and 568 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.6, 3.7, 3.8]
python-version: [3.7, 3.8, 3.9]

steps:
- uses: actions/checkout@v2
Expand Down Expand Up @@ -52,7 +52,7 @@ jobs:
conda info -a
# Create test environment and install deps
conda create -q -n test-environment python=${{ matrix.python-version }} setuptools pip cython numpy pandas nose samtools pysam
conda create -q -n test-environment python=${{ matrix.python-version }} setuptools pip cython numpy pandas nose samtools pysam scipy
source activate test-environment
pip install click
python setup.py build_ext -i
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -93,3 +93,7 @@ ENV/

# cython compiled C extension
_*.c
*.DS_Store

# VS code settings
.vscode/*
3 changes: 2 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,10 @@ language: python
python:
# We don't actually use the Travis Python, but this keeps it organized.
# - "3.5"
- "3.6"
# - "3.6"
- "3.7"
- "3.8"
- "3.9"
install:
# We do this conditionally because it saves us some downloading if the version is the same.
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
Expand Down
10 changes: 10 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
### Next release
* dedup: implemented in chunks with two new backends ("scipy", "sklearn"). Now allows to
record the readID of the retained "parent" read from a duplicate cluster in an extra
field in the file with duplicates. New backends rely on the header to define column
oder in the file, specification through CLI arguments works for the "cython" backend,
but it will be removed in a future version.
Note that with non-zero max-mismatch the behaviour of the new backends can be
different from the old "cython": now duplication is transitive (i.e. if read A is a
duplicate of read B, and read B - of read C, reads A and C are now considered
duplicates).
### 0.3.1 (2021-02-XX) ###

* sample: a new tool to select a random subset of pairs
Expand Down
Loading

0 comments on commit 7e4712f

Please sign in to comment.