Skip to content

erikvdplas/binding-site-predictor

Repository files navigation

Binding site predictor

Data

DNA

DNA is ordered in a 4D-tensor with for each training range, a one-hot encoding (4 element vector) of all nucleotides in that range (assumed to always be 200). If a sequence is unknown ('N' in FASTA format), the entire corresponding one-hot vector is zero. Then, for all nucleotides seperately, ranges are combined into bits in a uint8 array for compactness (see numpy.unpackbits). Eventually the data is pickled into dna.pkl with protocol 2.

ChIP

ChIP-seq peak conservative calls are collected per range and per protein-cell combination in a 3D-tensor. The data is then pickled into chip.conservative.pkl with protocol 2.

For experiments with only one protein (currently the only experiment supported) one can generate a protein-cell specific dataset by running python3 pick_chip.py --cell X --protein Y, where X and Y are the indices of both properties in the original tensor.

Training

One can start training on a specific protein-cell pair by running python3 train.py --epochs E --chip-path chip-X-Y.conservative.pkl. Check python3 train.py --help for additional training options.

About

Neural model to predict DNA binding sites of proteins

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages