Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pairtools 1.0.0 updates #117

Merged
merged 52 commits into from
Jun 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
5a1cc76
restrict fixes
agalitsyna Apr 7, 2022
7480fb5
pairtools flip fix for unannotated chromosomes, resolving https://git…
agalitsyna Apr 7, 2022
31d22f7
merge improvements, header merge fixed
agalitsyna Apr 8, 2022
4feda3a
merge improvements
agalitsyna Apr 8, 2022
b002dbe
I/O improvement
agalitsyna Apr 8, 2022
db465f6
Parse2 update (#99) (#109)
agalitsyna Apr 11, 2022
fed4699
black
agalitsyna Apr 11, 2022
5d5a5bc
separate cli and lib
agalitsyna Apr 12, 2022
12a4bf6
Merge pairlib into pairtools.lib.
agalitsyna Apr 12, 2022
1a8469b
scalings bugfix
agalitsyna Apr 12, 2022
215c370
CLI for scalings added.
agalitsyna Apr 14, 2022
bc53d1b
scaling tests
agalitsyna Apr 14, 2022
4a11db6
pairtools stats chromosome sizes
agalitsyna Apr 14, 2022
e946ef5
stats output in yaml format
agalitsyna Apr 14, 2022
0a809da
yaml dependency added to the workflows
agalitsyna Apr 14, 2022
4711f53
requirement update
agalitsyna Apr 14, 2022
c97a358
Python 3.10 added to workflow
agalitsyna Apr 14, 2022
e53b0ee
Python 3.10 fix
agalitsyna Apr 14, 2022
cdee78d
Header CLI (#121)
agalitsyna Apr 14, 2022
8a8a9fb
pairtools phase critical update (#114)
agalitsyna Apr 14, 2022
1bd71e6
Removed python 3.10 testing becuase of glibc problem in conda
agalitsyna Apr 14, 2022
edfa4f3
phase unmapped chrom fix
agalitsyna Apr 14, 2022
cfd2c06
add_cols fix for lib.parse
agalitsyna Apr 16, 2022
a1059e0
additional columns fix
agalitsyna Apr 19, 2022
072bacc
logger added
agalitsyna Apr 19, 2022
7f7fbaa
junction index update and parsing docs fix
agalitsyna Apr 19, 2022
578bbd4
Example walkthrough added.
agalitsyna Apr 19, 2022
ec05171
example cleanup
agalitsyna Apr 19, 2022
08fd8e6
minor change
agalitsyna Apr 20, 2022
9c26b51
imporant fixes: - cython dedup with no-parent id forgotten counter re…
agalitsyna Apr 25, 2022
4e9df8a
phasing fix of the phase2 bug
agalitsyna Apr 26, 2022
89ced33
walkthrough on phasing. improved walkthrough on parse
agalitsyna Apr 26, 2022
b8010e7
headerops get_colnames and viewframe input validation for scalings ad…
agalitsyna Apr 26, 2022
d07402e
scalings docstrings
agalitsyna Apr 26, 2022
0b4a6a8
bioframe dependency added
agalitsyna Apr 26, 2022
2bdac9a
Add summaries (#105)
Phlya Apr 27, 2022
32f3391
PairCounter merge fix
agalitsyna Apr 27, 2022
78fc54a
stats forgotten _fileio fix
agalitsyna Apr 27, 2022
40dd81c
R1/2 substituted with R1-2; docs improved
agalitsyna Apr 27, 2022
f4ac2d8
Benchmarks added
agalitsyna May 3, 2022
da087bb
Extract tile info much faster (pd.str.split sucks)
Phlya May 4, 2022
24f7a19
snakefile and restriction walkthrough update.
agalitsyna May 4, 2022
faf5878
Benchmarks finalization
agalitsyna May 11, 2022
3306195
benchmark fixes
agalitsyna May 11, 2022
e20267f
Fir dtype so it gets saved
Phlya May 19, 2022
9a21e15
[WIP] Stats split by filters (#132)
Phlya May 20, 2022
dabf4b7
Fix bug in complexity estimation
Phlya May 20, 2022
b08e8cd
Add yaml2pandas function
Phlya May 20, 2022
68ec83a
Make by tile dups int
Phlya May 20, 2022
7647057
Markasdup lib removed; markasdup CLI explanation improved
agalitsyna May 31, 2022
4ee6d67
dedup filter stats added and tested
agalitsyna Jun 1, 2022
ce150c3
Black
agalitsyna Jun 1, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 4 additions & 8 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,15 @@

name: Python package

on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
on: push

jobs:
build:

runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.7, 3.8, 3.9]
python-version: ["3.7", "3.8", "3.9"]

steps:
- uses: actions/checkout@v2
Expand All @@ -42,7 +38,7 @@ jobs:
export PATH="$HOME/miniconda/bin:$PATH"
hash -r
conda config --set always_yes yes --set changeps1 no
conda config --add channels conda-forge
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels bioconda
conda config --get
Expand All @@ -52,7 +48,7 @@ jobs:
conda info -a

# Create test environment and install deps
conda create -q -n test-environment python=${{ matrix.python-version }} setuptools pip cython numpy pandas nose samtools pysam scipy
conda create -q -n test-environment python=${{ matrix.python-version }} setuptools pip cython numpy pandas nose samtools pysam scipy pyyaml bioframe
source activate test-environment
pip install click
python setup.py build_ext -i
Expand Down
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ __pycache__/

# C extensions
*.so
*.c
*.cpp

# Distribution / packaging
.Python
Expand All @@ -18,7 +20,6 @@ dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
Expand Down Expand Up @@ -97,3 +98,6 @@ _*.c

# VS code settings
.vscode/*

# Files generated as the examples
examples/*
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ install:
- conda info -a

# Create test environment and install deps
- conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION setuptools pip cython numpy pandas nose samtools pysam
- conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION setuptools pip cython numpy pandas nose samtools pysam bioframe
- source activate test-environment
- pip install click
- python setup.py build_ext -i
Expand Down
134 changes: 134 additions & 0 deletions doc/_static/report-orientation.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
337 changes: 337 additions & 0 deletions doc/_static/report-positions.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
356 changes: 227 additions & 129 deletions doc/_static/rescue_modes.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
390 changes: 241 additions & 149 deletions doc/_static/rescue_modes_readthrough.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion doc/formats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Pairtools' flavor of .pairs

.pairs files produced by `pairtools` extend .pairs format in a few ways.

1. `pairtools` store null/ambiguous/chimeric alignments as chrom='!', pos=0, strand='-'.
1. `pairtools` store null, unmapped, ambiguous (multiply mapped) and chimeric (if not parsed by `parse2` or `--walks-policy all` of `parse`) alignments as chrom='!', pos=0, strand='-'.

#. `pairtools` store the header of the source .sam files in the
'#samheader:' fields of the pairs header. When multiple .pairs files are merged,
Expand Down
126 changes: 79 additions & 47 deletions doc/parsing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,64 +107,27 @@ If the read is long enough (e.g. larger than 150 bp), it may contain more than t

Molecules like these typically form via multiple ligation events and we call them
walks [1]_. The mode of walks reporting is controlled by ``--walks-policy`` parameter of ``pairtools parse``.
You can report all the alignments in the reads by using ``pairtools parse2`` (see :ref:`parse2`).

A pair of sequential alignments on a single read is **ligation junction**. Ligation junctions are the Hi-C contacts
that have been directly observed in the experiment. They are reported in lower-case letters if walks policy
is set to ``all`` (default). For details, wee :ref:`section-complex-walks-rescue`

However, traditional Hi-C pairs do not have direct evidence of ligation
that have been directly observed in the experiment. However, traditional Hi-C pairs do not have direct evidence of ligation
because they arise from read pairs that do not necessarily contain ligation junction.
To filter out the molecules with complex walks, ``--walks-policy`` can be set to:

- ``mask`` to tag these molecules as type ``WW`` (single ligations are rescued, see :ref:`section-single-ligation-rescue`) ,
- ``mask`` to tag these molecules as type ``WW`` (single ligations are rescued, see :ref:`Rescuing single ligations`),
- ``5any`` to report the 5'-most alignment on each side,
- ``5unique`` to report the 5'-most unique alignment on each side,
- ``3any`` to report the 3'-most alignment on each side,
- ``3unique`` to report the 3'-most unique alignment on each side.


.. _section-complex-walks-rescue:
- ``3unique`` to report the 3'-most unique alignment on each side,
- ``all`` to report all sequential alignments (complex ligations are rescued, see :ref:`Rescuing complex walks`).

Rescuing complex ligations
-------------------------

The complex walks are DNA molecules containing more than one ligation junction that may end up in more than one alignment
on forward, reverse, or both reads:
Parse modes for walks:

.. figure:: _static/rescue_modes.svg
:width: 60 %
:alt: Different modes of reporting complex walks
:align: center

Different modes of reporting complex walks

``pairtools parse`` detects such molecules and **rescues** them with ``--walks-policy all``.

Briefly, the algorithm of complex ligation walks rescue detects all the unique ligation junctions, and do not report
the same junction as a pair multiple times. Importantly, these duplicated pairs might arise when both forward and reverse
reads read through the same ligation junction. However, these cases are successfully merged by ``pairtools parse``:

.. figure:: _static/rescue_modes_readthrough.svg
:width: 60 %
:alt: Reporing complex walks in case of readthrough
:alt: Parse modes for walks
:align: center

Reporing complex walks in case of readthrough

To restore the sequence of ligation events, there is a special field ``junction_index`` that can be reported as
a separate column of .pair file by setting ``--add-junction-index``. This field contains information on:

- the order of the junction in the recovered walk, starting from 5'-end of forward read
- type of the junction:

- "u" - unconfirmed junction, right and left alignments in the pair originate from different reads (forward or reverse). This might be indirect ligation (mediated by other DNA fragments).
- "f" - pair originates from the forward read. This is direct ligation.
- "r" - pair originated from the reverse read. Direct ligation.
- "b" - pair was sequenced at both forward and reverse read. Direct ligation.
With this information, the whole sequence of ligation events can be restored from the .pair file.


.. _section-single-ligation-rescue:

Rescuing single ligations
-------------------------
Expand Down Expand Up @@ -209,7 +172,7 @@ walks with three aligments using three criteria:

Sometimes, the "inner" alignment on the chimeric side can be non-unique or "null"
(i.e. when the unmapped segment is longer than ``--max-inter-align-gap``,
as described in :ref:`section-gaps`). ``pairtools parse`` ignores such alignments
as described in :ref:`Interpreting gaps between alignments`). ``pairtools parse`` ignores such alignments
altogether and thus rescues such *walks* as well.

.. figure:: _static/read_pair_UR_MorN.png
Expand All @@ -220,8 +183,6 @@ altogether and thus rescues such *walks* as well.
A walk with three alignments get rescued, when the middle alignment is multi- or null.


.. _section-gaps:

Interpreting gaps between alignments
------------------------------------

Expand All @@ -247,6 +208,77 @@ longer ones as "null" alignments. The maximal size of ignored *gaps* is set by
the ``--max-inter-align-gap`` flag (by default, 20bp).


Rescuing complex walks
-------------------------

We call the multi-fragment DNA molecule that is formed during Hi-C (or any other chromosome capture with sequencing) a walk.
If the reads are long enough, the right (reverse) read might read through the left (forward) read.
Thus, left read might span multiple ligation junctions of the right read.
The pairs of contacts that overlap between left and right reads are intermolecular duplicates that should be removed.

If the walk has no more than two different fragments at one side of the read, this can be rescued with simple
``pairtools parse --walks-policy mask``. However, in complex walks (two fragments on both reads or more than two fragments on any side)
you need specialized functionality that will report all the deduplicated pairs in the complex walks.
This is especially relevant if you have the reads length > 100 bp, since more than 20% or all restriction fragments in the genome are then shorter than the read length.
We put together some statistics about number of short restriction fragments for DpnII enzyme:

======== ================= ================== ================== ================== ==================
Genome #rfrags <50 bp <100 bp <150 bp <175 bp <200 bp
-------- ----------------- ------------------ ------------------ ------------------ ------------------
hg38 828538 (11.5%) 1452918 (20.2%) 2121479 (29.5%) 2587250 (35.9%) 2992757 (41.6%)
mm10 863614 (12.9%) 1554461 (23.3%) 2236609 (33.5%) 2526150 (37.9%) 2780769 (41.7%)
dm3 65327 (19.6%) 108370 (32.5%) 142662 (42.8%) 156886 (47.1%) 169339 (50.9%)
======== ================= ================== ================== ================== ==================

Consider the read with overlapping left and right sides:

.. figure:: _static/rescue_modes_readthrough.svg
:width: 60 %
:alt: Complex walk with overlap
:align: center

Such molecules are detected and **rescued** them. Briefly, we detects all the unique ligation junctions,
and do not report the same junction as a pair multiple times.

To rescue complex walks, you may use ``pairtools parse --walks-policy all`` and ``parse2``.
They have slightly different functionalities.

``pairtools parse --walks-policy all`` is used with regular paired-end Hi-C, when you want
all pairs in the walk to be reported as if they appeared in the sequencing data independently.

``parse2`` is used with single-end data or when you want to report different mode of orientation or position.
By default, ``parse2`` reports ligation junctions instead of outer ends of the alignmentns.
It may report also the position or orientation of the walk or of individual read.

The complete guide through the reporting options of ``parse2``, orientation reporting:

.. figure:: _static/report-orientation.svg
:width: 60 %
:alt: parse2 --report-orientation
:align: center


position reporting:

.. figure:: _static/report-positions.svg
:width: 60 %
:alt: parse2 --report-position
:align: center


To restore the sequence of ligation events, there are special fields ``walk_pair_index`` and ``walk_pair_type`` that you have as
a separate column of .pair file when setting ``--add-pair-index`` option.

- ``walk_pair_index`` contains information on the order of the pair in the recovered walk, starting from 5'-end of left read
- ``walk_pair_type`` describes the type of the pair relative to R1 and R2 reads of paired-end sequencing:

- "R1-2" - unconfirmed pair, right and left alignments in the pair originate from different reads (left or right). This might be indirect ligation (mediated by other DNA fragments).
- "R1" - pair originates from the left read. This is direct ligation.
- "R2" - pair originated from the right read. Direct ligation.
- "R1&2" - pair was sequenced at both left and right read. Direct ligation.
With this information, the whole sequence of ligation events can be restored from the .pair file.



.. [1] Following the lead of `C-walks <https://www.nature.com/articles/nature20158>`_

Expand Down
Binary file removed examples/Test_Parse_Walks/TestCase1.png
Binary file not shown.
Binary file removed examples/Test_Parse_Walks/TestCase2.png
Binary file not shown.
Binary file removed examples/Test_Parse_Walks/TestCase2a.png
Binary file not shown.
Binary file removed examples/Test_Parse_Walks/TestCase3.png
Binary file not shown.
Binary file removed examples/Test_Parse_Walks/TestCase4.png
Binary file not shown.
Binary file removed examples/Test_Parse_Walks/TestCase5.png
Binary file not shown.
Loading