Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse2: created #96

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 44 additions & 46 deletions doc/parsing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,12 +107,10 @@ If the read is long enough (e.g. larger than 150 bp), it may contain more than t

Molecules like these typically form via multiple ligation events and we call them
walks [1]_. The mode of walks reporting is controlled by ``--walks-policy`` parameter of ``pairtools parse``.
You can report all the alignments in the reads by using ``pairtools parse2`` (see :ref:`parse2`).

A pair of sequential alignments on a single read is **ligation junction**. Ligation junctions are the Hi-C contacts
that have been directly observed in the experiment. They are reported in lower-case letters if walks policy
is set to ``all`` (default). For details, wee :ref:`section-complex-walks-rescue`

However, traditional Hi-C pairs do not have direct evidence of ligation
that have been directly observed in the experiment. However, traditional Hi-C pairs do not have direct evidence of ligation
because they arise from read pairs that do not necessarily contain ligation junction.
To filter out the molecules with complex walks, ``--walks-policy`` can be set to:

Expand All @@ -122,50 +120,8 @@ To filter out the molecules with complex walks, ``--walks-policy`` can be set to
- ``3any`` to report the 3'-most alignment on each side,
- ``3unique`` to report the 3'-most unique alignment on each side.


.. _section-complex-walks-rescue:

Rescuing complex ligations
-------------------------

The complex walks are DNA molecules containing more than one ligation junction that may end up in more than one alignment
on forward, reverse, or both reads:

.. figure:: _static/rescue_modes.svg
:width: 60 %
:alt: Different modes of reporting complex walks
:align: center

Different modes of reporting complex walks

``pairtools parse`` detects such molecules and **rescues** them with ``--walks-policy all``.

Briefly, the algorithm of complex ligation walks rescue detects all the unique ligation junctions, and do not report
the same junction as a pair multiple times. Importantly, these duplicated pairs might arise when both forward and reverse
reads read through the same ligation junction. However, these cases are successfully merged by ``pairtools parse``:

.. figure:: _static/rescue_modes_readthrough.svg
:width: 60 %
:alt: Reporing complex walks in case of readthrough
:align: center

Reporing complex walks in case of readthrough

To restore the sequence of ligation events, there is a special field ``junction_index`` that can be reported as
a separate column of .pair file by setting ``--add-junction-index``. This field contains information on:

- the order of the junction in the recovered walk, starting from 5'-end of forward read
- type of the junction:

- "u" - unconfirmed junction, right and left alignments in the pair originate from different reads (forward or reverse). This might be indirect ligation (mediated by other DNA fragments).
- "f" - pair originates from the forward read. This is direct ligation.
- "r" - pair originated from the reverse read. Direct ligation.
- "b" - pair was sequenced at both forward and reverse read. Direct ligation.
With this information, the whole sequence of ligation events can be restored from the .pair file.


.. _section-single-ligation-rescue:

Rescuing single ligations
-------------------------

Expand Down Expand Up @@ -247,6 +203,48 @@ longer ones as "null" alignments. The maximal size of ignored *gaps* is set by
the ``--max-inter-align-gap`` flag (by default, 20bp).


Parse2
-------------------------

If your Hi-C has long reads, you may want to report all the alignments in the reads with ``pairtools parse2``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe specify read length where it would benefit? e.g.
If your Hi-C has long reads (>50bp).

Would this work on nanopore reads or not? Maybe good to mention somewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

150 bp allows you to save 10% of the simple Hi-C library on DpnII. On Nanopore this won't work for now. Good points, will mention!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by #99


The complex walks are DNA molecules containing more than one ligation junction that may end up in more than one alignment
on forward, reverse, or both reads:

.. figure:: _static/rescue_modes.svg
:width: 60 %
:alt: Different modes of reporting complex walks
:align: center

Different modes of reporting complex walks

``pairtools parse2`` detects such molecules and **rescues** them.

Briefly, the algorithm of complex ligation walks rescue detects all the unique ligation junctions, and do not report
the same junction as a pair multiple times. Importantly, these duplicated pairs might arise when both forward and reverse
reads read through the same ligation junction. However, these cases are successfully merged by ``pairtools parse2``:

.. figure:: _static/rescue_modes_readthrough.svg
:width: 60 %
:alt: Reporting complex walks in case of readthrough
:align: center

Reporing complex walks in case of readthrough

To restore the sequence of ligation events, there is a special field ``junction_index`` that can be reported as
a separate column of .pair file by setting ``--add-junction-index``. This field contains information on:

- the order of the junction in the recovered walk, starting from 5'-end of forward read
- type of the junction:

- "u" - unconfirmed junction, right and left alignments in the pair originate from different reads (forward or reverse). This might be indirect ligation (mediated by other DNA fragments).
- "f" - pair originates from the forward read. This is direct ligation.
- "r" - pair originated from the reverse read. Direct ligation.
- "b" - pair was sequenced at both forward and reverse read. Direct ligation.
With this information, the whole sequence of ligation events can be restored from the .pair file.


.. _section-single-ligation-rescue:

.. [1] Following the lead of `C-walks <https://www.nature.com/articles/nature20158>`_

Expand Down
1 change: 1 addition & 0 deletions pairtools/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ def wrapper(*args, **kwargs):
from .pairtools_restrict import restrict
from .pairtools_phase import phase
from .pairtools_parse import parse
from .pairtools_parse2 import parse2
from ._parse import parse_cigar, parse_algn # TODO: is this import needed?
from .pairtools_stats import stats
from .pairtools_sample import sample
Expand Down
Loading