GitHub - ehan1990/dna-sequence-grouping: DNA sequence grouping

Results are in dna_files/results
Each result file follows this naming convention:
- {input_name}.d{dist}_c{count_of_top_unique_seq}_g{num_groups}_t{total_input_seq}.fasta
- e.g. sample.d1_c740_g40_t805.fasta

Each string contains 4 letters, and letters can only be A, T, C, G.
Each string is ~200 letters.
Each file contains ~10k strings.
Similarity is dependent on the position of the letter, and not just based of # of similar letters.
If 190/200 of a string of 200 characters are the same, then we consider them as the same.

Read all strings from a fasta file, and find unique strings (don't need to compare strings across different files).
Find similar strings, and group them together (e.g. AATT and AAAT, etc).
Use % for similarity.

> sample-23 seq appeared 740 times
aaccgg
> sample-25 seq appeared 30 times
aacggg
> sample-80 seq appeared 3 times
ccaagg
...

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dna_files		dna_files
libs		libs
output		output
test_files		test_files
tests		tests
Makefile		Makefile
ReadMe.Md		ReadMe.Md
app.py		app.py
provision.sh		provision.sh
requirements.txt		requirements.txt

Provide feedback