Transcription factors play an important role in the gene regulation. To search for the regulatory regions in the genome the Transcription Factor Binding Sites Scan (TFBSscan) was developed.
Please mention that CONDA is needed for the installation of the TFBSscan.
- Clone the directory
git clone https://github.com/petrokvitka/TFBScan
- Switch to the directory
cd TFBSscan
- Create the needed environment from the file tfbs.yaml
conda env create --file tfbs.yaml
- Activate the environment
source activate tfbs
- You are ready to use the TFBSscan! To learn how to use TFBSscan, just type
python tfbsscan.py
The TFBSscan requires two input files:
- file with transcription factor motifs, which can be downloaded from JASPAR database in .MEME or .PFM format;
- a file with genome in FASTA format.
To optimize the scanning, a file in the .BED format with peaks can also be provided. The peak is a so called open region or in other words the region of interest in the ATAC-seq. If the .BED file with peaks was provided, the TFBSscan run will be way faster, as we will look for the needed information only within the regions we are interested in.
As an output the TFBSscan produces the file in .BED format with all found binding sites for each motif provided as input. If only one motif was provided as input, only one output file will be created. For three input motifs three output files in .BED format will be outputed. If the output file is empty, no binding sites for this motif were found.
There are currently two different tools a user can choose in between when starting the TFBSscan:
- FIMO (Find Individual Motif Occurences) as a part of the MEME Suite. There is a possibility to pass additional parameters to FIMO. TFBSscan uses by default the next parameters for FIMO call:
--text
to print the FIMO output in a simple text file and--no-qvalue
because the calculation of q-values is not possible while using the parameter--text
. - MOODS (Motif Occurrence Detection Suite).
FIMO and MOODS need the input files in different formats (.MEME for FIMO and .PFM for MOODS). The TFBSscan allows an automatic convertion of these both formats regarding on the tool the user wants to use.
There are several options to customize the TFBSscan:
--output_directory
to set the desired output directory, by default the output files will be saved to the path ./output/;--bed_file
to provide a file with peaks of interest to search within in .BED format;--use [fimo/moods]
to choose the tool (by default TFBSscan uses FIMO);--clean [all/nothing/cut_motifs/fimo_output/merge_output/moods_output]
set one or more option from the list to be able to chose the temporare files from the output directory after the run. By default TFBSscan deletes all temporary files and will leave only the output files with binding sites for each motif in .BED format;--cores
to set how many cores the TFBSscan is allowed to use while starting the multiprocessing. By default only 2 cores are allowed;--fimo "option"
pass additional options for the FIMO run within the quotation marks. For example, a background file created with fasta-get-markov can be passed;--p_value
set the p-value for the search. By default the p-value is 1e-4 or 0.0001;--resolve_overlaps
to delete the ovelaps with lower score. By default no overlaps are deleted;--moods_bg 0.25 0.25 0.25 0.25
to set hte background for MOODS. By default the standard MOODS background is used which is 0.25 0.25 0.25 0.25;--hide_info
to print the information about the run only to the log-file. By default all the information is printed to the terminal.
To try how the TFBSscan works, the test data was uploaded to the folder example. The file with 3 random motifs was downloaded from the JASPAR database in .MEME format. The fasta file mm10 with the first chromosome is used to create the file in the example_output folder. The TFBSscan call used for this example is:
python tfbsscan.py -g ../chr1.fa -m example/3_motifs.meme -o example_output
There are three output files in .BED format in the example_output folder and a logfile containing information about the run. Each of the output .BED files contains the binding sites of the corresponding motif for the first chromosome of the mm10 genome.