Skip to content

File Handling

Rajan edited this page Nov 29, 2022 · 4 revisions

File handling python scripts

extract_accession_no.py

This script take MULTI-FASTA file as input and write all the Accession Number(s) in a new file (accession_no.txt)

$ python extract_accession_no.py <Multi_FASTA_File>

extract_fasta_headers.py

This script take (Multi)Fasta file as input and write the Sequence Header(s) in a new file (fasta_headers.txt)

$ python extract_fasta_headers.py <Multi_FASTA_File>

extract_fasta_records.py

This script extracts Fasta-records from Multi-Fasta file whose Accession-No(s) are in Accession-Ids file

$ python extract_fasta_records.py <Multi_FASTA_File> <Accession_IDs_File>

fasta_record_finder.py

This script extract Fasta-record from Multi-Fasta file whose Accession-No is inputted by the user and write the record in a new file (NC_XXXXXX.fasta)

$ python fasta_record_finder.py <Multi_FASTA_File>

fasta_concatenator.py

This script merge all the files with (.fasta) extension and create a new file (multi_fasta)

$ python fasta_concatenator.py

multi_fasta_deconcatenator.py

This script split multi fasta file into individual fasta file(s)

$ python multi_fasta_deconcatenator.py <Multi_FASTA_File>

file_comparison.py

This script compare two files and return the elements present in one file but not in other

$ python file_comparison.py -f1 <File_1> -f2 <File_2>

ftp_download.py

This script download all the files whose ftp addresses are listed in ftpfilepaths file

$ python ftp_download.py <ftpfilepaths>

seq_concatenator.py

This script takes multi fasta file with gene sequences and concatenate them according to the accession id (as shown below)

Multi-Fasta file [ INPUT ]

>ECO_1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>ECO_2
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>ECO_3
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>SAL_1
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>SAL_2
GCGCGCGGGCGCGCGCGCGCGCGCGCGCGCGC
>SAL_3
TATATTATATATTATATATTTATATAATAATA

concatenated_seq.fasta file [ OUTPUT ]

>ECO
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>SAL
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GCGCGCGGGCGCGCGCGCGCGCGCGCGCGCGC
TATATTATATATTATATATTTATATAATAATA
$ python seq_concatenator.py <Multi-Fasta>

extract_seq.py

Program to extract nucleotide or protein sequence of particular index (e.g. 200...300) from a Fasta file

$ python extract_seq.py <file.fasta>

compare_bed.py

Compare two bed files for sequence overlaps

$ python compare_bed.py file1.bed file2.bed

gdc_download.py

Download all the files whose IDs are listed in gdc_manifest file (downloaded from TCGA GDC portal)

$ python gdc_download.py <gdc_manifest.txt>

clustal_to_fasta.py

Convert Multiple Sequence Alignment (MSA) file in Clustal Omega format to FASTA format

$ python clustal_to_fasta.py <file.clustal_num> <file.fasta>

clustal_to_tsv.py

Convert Multiple Sequence Alignment (MSA) file in Clustal Omega format to .tsv format

$ python clustal_to_tsv.py <file.clustal_num>

fasta2db_feed.py

Feed sequence data as hash into MySQL database using python connector

$ python fasta2db_feed.py <sequence.fasta>

mysqldb_find.py

Feed a hash into MySQL database using python connector

$ python mysqldb_find.py

fastq2fasta.py

Convert sequences in FASTQ format to FASTA format

$ python fastq2fasta.py <seq.fastq>

fasta2fastq.py

$ python fasta2fastq.py -f <sequence.fasta> -l <read_length> -x <coverage> -o <sequence.fastq>