rfasta is designed for bioinformaticians and protein scientists who need a fast, reliable tool for parsing, cleaning, and manipulating protein sequence FASTA files.
rfasta is a direct port of the python package protfasta into rust. This greatly improves the performance of parsing, cleaning, and manipulation of LARGE protein sequence fasta files such as those from uniref.
Sanitize fasta file - removes duplicate sequences, invalid sequences [in this case all sequences with noncanonical amino acids]
rfasta clean --non-unique-header --duplicate-record remove --invalid-sequence remove test.fasta -o output.fasta
[INFO]: Read in file with 100005 lines
[INFO]: Parsed file to recover 11085 sequences
[INFO]: Removed 68 of 11085 sequences due to invalid characters
[INFO]: Removed 1 of 11017 sequences due to duplicate records
[INFO]: Wrote 11016 sequences to output.fasta
rfasta split --output-dir . --chunks 3 output.fasta
[INFO]: Read in file with 110670 lines
[INFO]: Parsed file to recover 11016 sequences
[INFO]: Wrote 3672 sequences to ./output_000001.fasta
[INFO]: Wrote 3672 sequences to ./output_000002.fasta
[INFO]: Wrote 3672 sequences to ./output_000003.fasta
[INFO]: Split FASTA into 3 chunks
- Core functionality for:
- Parsing: Read and interpret protein FASTA files efficiently.
- Cleaning: Remove invalid entries and ensure sequences conform to biological standards.
- Manipulation: Efficient fasta sharding operations on large protein sequence fasta files.
- Rust CLI integration for command-line use cases.
- Python bindings via PyO3 for seamless Python library integration.
- High performance with optimized parsing for large-scale FASTA files (e.g., UniRef datasets).
- Early-stage development—additional features, documentation, and pypi deployment to follow in subsequent releases.