Skip to content

fasta_diff

dytk2134 edited this page Sep 12, 2018 · 1 revision

Introduction

Compares two very similar FASTA files and outputs coordinate mappings using a multi stage algorithm.

Algorithm:

  • Stage 1: Find 100% matches

  • Stage 2: Find 100% substrings, where the full length of a new sequence can be found as a substring of a old sequence

  • Stage 3: Find cases where part of the sequence was converted into Ns
    1. split new seq by Ns into multiple substrings
    2. find an old seq with all substrings and also in order
    3. if found, write an alignment line for every substring
    4. check if the alignment line of every substring can be joining together
    5. check if the next base of the old sequence is same as the next base of the new sequence. If so extend both the end coordinate of old and new seuqnece in the 'alignment line'.

  • Stage 4: Find cases where a old sequence is split into two or more new sequences
    1. check if there is a old sequence mapped to two or more new sequences
    2. check if there is no overlap between the coordinate mappings of new sequences
    3. write alignment lines for every coordinate mappings

There is a overlap between the coordinate mappings of new1 and new2. Therefore only the old and new3 mapping will be remained.

How it works?

Coordinate mappings (0-based):

Old     0       9       New     0       9
Old     11      19      New     11      19
  • Outputs the 6 columns as tab-separated values: old_id, old_start, old_end, new_id, new_start, new_end
  • There are two models in this example: blue (0-based: 3 - 9; 1-based: 4 - 9) and red (0-based: 10 - 14; 1-based: 11 - 14)
  • After using the conversion program, the model with coordinates that are not contained within match_old_start to match_old_end will be removed. Therefore, the red one will be removed and the blue one will be remained.

Usage:

fasta_diff example_file/old.fa example_file/new.fa -o match.tsv -r report.txt

Note that fasta_diff can be piped with another coordinates_conversion program so that the intermediate match.tsv file isn’t saved to disk

Example progress output:

command

fasta_diff /app/data/other_species/diacit/diaci1.1/diaci1.1.fa /app/data/other_species/diacit/NCBI_RefSeq_diaci1.1/scaffold/121845_ref_Diaci_psyllid_genome_assembly_version_1.1_chrUn.fa > match.tsv

output info

 INFO     Reading alignment data from: <stdin>...
 INFO     Reading old FASTA file (/app/data/other_species/diacit/diaci1.1/diaci1.1.fa)...
 INFO       Unique sequences: 163023
 INFO     Reading new FASTA file  (/app/data/other_species/diacit/NCBI_RefSeq_diaci1.1/scaffold/121845_ref_Diaci_psyllid_genome_assembly_version_1.1_chrUn.fa)...
 INFO       Unique sequences: 161988
 INFO     Stage 1 - match_identical_sequence:
 INFO       Matched sequences: 160938 (New : 160938)
 INFO       Unmatched sequences in old FASTA: 2085
 INFO       Unmatched sequences in new FASTA: 1050
 INFO     Stage 2 - match_truncated_sequence:
 INFO       Matched sequences: 161233 (New : 295)
 INFO       Unmatched sequences in old FASTA: 1790
 INFO       Unmatched sequences in new FASTA: 755
 INFO     Stage 3 - match_split_subsequence:
 INFO       Matched sequences: 161988 (New : 755)
 INFO       Unmatched sequences in old FASTA: 1035
 INFO       Unmatched sequences in new FASTA: 0
 INFO     Stage 4 - one_to_multiple_match (New : 0)
 INFO       Matched sequences: 161988 (New : 0)
 INFO       Unmatched sequences in old FASTA: 1035
 INFO       Unmatched sequences in new FASTA: 0

Example output file(match.tsv)

old_id old_start old_end new_id new_start new_end
Scaffold1 0 3368518 KK245166.1 0 3368518
Scaffold4139 2368 8532 JHOM01041610.1 0 6164
Scaffold1688 0 390 KK246853.1 0 390
Scaffold1688 2775 4110 KK246853.1 2775 4110
Scaffold1688 4670 5814 KK246853.1 4670 5814

Example report file

If -r [report] or --report [report] is given, fasta_diff.py will generate a report for the unmatched sequences in new FASTA file.

#Sequence_ID    Sequence_length Stage
#Stage 1 - match_identical_sequence:
JJRW01S0013007.1        93719   Stage 1
JJRW01S0001013.1        399926  Stage 1
JJRW01S0000843.1        319524  Stage 1
#Stage 2 - match_truncated_sequence:
JJRW01S0013007.1        93719   Stage 2
JJRW01S0001013.1        399926  Stage 2
#Stage 3 - match_split_subsequence:
JJRW01S0001013.1        399926  Stage 3
#Stage 4 - one_to_multiple_match:
JJRW01S0001013.1        399926  Stage 4

Running the program with –h prints the following help:

fasta_diff -h

usage: fasta_diff [-h] [-o [OUT]] [-r REPORT] [-d] [-hc] [-v]
                  old_fasta new_fasta

Compares two very similar FASTA files and outputs coordinate mappings using a multi stage algorithm:
Stage 1: Find 100% matches
Stage 2: Find 100% substrings, where the full length of a new sequence can be found as a substring of a old sequence
Stage 3: Find cases where part of the sequence was converted into Ns
Outputs the 6 columns as tab-separated values: old_id, old_start, old_end, new_id, new_start, new_end
Originally created to compare the NCBI version to the original reference of Diaphorina citri

Example:
    fasta_diff example_file/old.fa example_file/new.fa -o match.tsv -r report.txt

positional arguments:
  old_fasta             The original FASTA file
  new_fasta             The new FASTA file

optional arguments:
  -h, --help            show this help message and exit
  -o [OUT], --out [OUT]
                        The output alignment file (default: STDOUT)
  -r REPORT, --report REPORT
                        Generate a report for the unmatched sequences in new
                        FASTA file.
  -d, --debug           If set, partial results are saved in a
                        *_stage_i_pickle file after each stage, unmatched
                        sequences after each stage are saved in a
                        *_stage_i_unmatched file for both FASTA files.
  -hc, --header_check   If set, confirm the detected mapping by checking the
                        header of the new sequence for the id of the mapped
                        old sequence.
  -v, --version         show program's version number and exit