In this repo we implemented the methods which we used in the "Predicting drug resistance in M. tuberculosis using a Long-term Recurrent Convolutional Networks architecture" paper. In this paper, we proposed a model based on the Long-term Recurrent Convolutional Network (LRCN) that utilizes baysian optimization methods for tuning its hyper-parameters. Please make sure to check out our paper at here.
This repo uses the datasets generated by our pipeline (publicly available here) as the input for trainig and tuning the models. We have developed the loading_data package which loads the created datasets efficiently for model training. Together with the utiliy functions developed here, the datasets used the in the experiments described in the paper are generated.
This directory contains code for generating the datasets explained in the paper.
This code creates the gene dataset, using the SNP-based dataset. As explained in the paper, the gene_dataset is generated by counting the number of SNPs that falls within each gene. The result is a matrix in which the rows correspond to the samle isolates and the coloumns are the genes. Note that the genes which do not contain any SNPs across all sample isolates are droped from the table, since they do not have any information.
This code finds the operon indexes in the gene-based dataset.
We shuffled the features in this code as we explained in the paper.
Here is a brief description for the core functions of the loading_data package process()
:
This function fetches and loads the dataset files based on the flags it is provided. The labels are also loaded and indexed according to the flags. Two pandas dataframes (df_train
,labels
) are returned.
- num_of_files: number of files that must be loaded (SNP-based data only). Note that the files will be concatenated coloumn-wise.
- nrow: num of rows which are loaded from each file. This parameter is specifically handy when debugging and prototyping when set to a small number. Default
0
- gene:
- limited: if set to
True
, five of the drugs would be drop (ciprofloxacin, capreomycin, amikacin, ethionamide, moxifloxacin). Default isFalse
. - gene_dataset: if set to
True
, the gene-based data would be loaded. Default isFalse
. - shuffle_index: if set to
True
, the shuffled_index data would be loaded. Default isFalse
. - index_file: the index of the shuffled file to used. Default is
0
. - random_data: if set to
True
, the random data would be loaded. Default isFalse
. - shuffle_operon: if set to
True
, the globally shuffled operons data would be loaded. Default isFalse
. - shuffle_operon_locally: if set to
True
, the locally shuffled operons data would be loaded. Default isFalse
. - shuffle_operon_group: if set to
True
, the grouped shuffled operons data would be loaded. Default isFalse
.
You will not need to modify or call the other methods from this package, unless you are using a customized dataset.
Once the dataset is loaded, based on the type of data the baysian optimization is performed to find the best model with the highest accuracy on the validation set.
This is the main class for running the models on the gene-based dataset. it simply loads the data using "data_preprocess.process()" and then run the proper model on it.
This is the main class for running the models on the SNP-based dataset. it starts by loading the data using "data_preprocess.process()". Then loaded data is split for performing a k-fold cross validation and on each fold, a baysian optimization on the hyper-parameters of the model is performed to tune them. Each model is evaluated based on its ROC.
you need to run the "run_bayesian()" function with proper data as input. it will run the proper functions in other classes and will print the output.
This file has the main implementation of the model. Which is the implementation of LRCN with K-fold and the Bayesian optimization.
If you found the content of this repository useful, please cite us:
https://dl.acm.org/doi/abs/10.1145/3459930.3469534