Last updated: Apr. 24, 2024
- GNU Parallel (tested on 20230722)
- Python (tested on 3.12.0)
- pygini (tested on 1.0.1)
- pandas (tested on 2.1.2)
- numpy (tested on 1.26.0)
- PyTorch ((tested on 2.1.2, for deep learning-based model only)
- scipy (tested on 1.11.4, for deep learning-based model only)
- R (tested on 4.3.2)
- CMplot (tested on 4.4.3)
- (other packages required as dependencies)
- Copy number variation data (for logistic regression-based model only)
- Single-cell Hi-C contact matrices
Note: All files are required to be at 1M resolution. Reliability is not guaranteed at other resolutions. See File format requirements for examples. File names are currently hard-coded. Modify the section below if needed:
In ecDNAcaller
:
####################
cnv_name="1000000.CNV.bedGraph"
mat_name="matrix.mtx"
####################
In ecDNAcaller_deep.py
:
####################
mat_name = "matrix.mtx"
####################
We recommend to run ecDNAcaller
and ecDNAcaller_deep.py
in a conda/mamba environment.
conda create -n ecDNAcaller python parallel pandas numpy PyTorch scipy
conda activate ecDNAcaller
pip install pygini
Then clone this repository to a local directory.
-
Make sure the example data (LC675 brain tumor dataset) is downloaded.
-
Run command below (modify
-t
ofecDNAcaller
and the last parameter ofecDNAcaller_deep.py
according to the number of threads available):
sh ecDNAcaller -i example_data -o example_out/lm -p 0.95 -t 32
python ecDNAcaller_deep.py example_data example_out/deep 32
-
Find
example_out/lm/example_data_summary_0.95_prob/example_data_count_freq_ecDNA.txt
. -
Find
example_data_summary_all.txt
,example_data_summary_ecDNA.txt
andexample_data_summary_HSR.txt
in directoryexample_out/deep
. -
Compare files generated by yourself with those in the repository's
example_out
. All files should be identical. -
Modify the following line with the actual path of files above in
CMPlot.R
:
count_freq_file <- "example_out/ecDNA_summary_example_data_0.95/example_data_count_freq.txt"
- Then run
CMPlot.R
in interactive mode and generate plots.
sh ecDNAcaller -i <INPUT_PATH> -o <OUTPUT_PATH> -p <PROB_THRESHOLD> -t <NUM_THREADS>
sh ecDNAcaller -i <INPUT_PATH> -o <OUTPUT_PATH> -p <PROB_THRESHOLD> -t <NUM_THREADS> -s <BOOL_OPTION>
Arguments:
-i
input directory. See Directory hierarchy requirements for details.
-o
output directory. See Directory hierarchy requirements for details.
-p
probability cutoff threshold. 0.95 is recommended by default.
-t
number of threads. Must be an integer.
-s
(optional) summary-only mode. Use -s true
to skip redundant cell processing (for example, if you need to change
the probability cutoff threshold and re-summarize).
Note: current version will ignore all interactions that involve chromosome Y ('chrY'
).
Output example:
> ls -R example_out
example_out:
example_data_cache example_data_summary_0.95_prob
example_out/example_data_cache:
LC675_AAACGAAAGGGTTCTT.txt LC675_GTTACGAGTATGGGTG.txt
...
example_out/example_data_summary_0.95_prob:
example_data_bin_barcode_matrix_cnv.txt example_data_bin_barcode_matrix_gini.txt
example_data_bin_barcode_matrix_pred.txt example_data_bin_barcode_matrix_ratio.txt
example_data_count_freq_ecDNA.txt
Seek example_data_bin_barcode_matrix_pred.txt
for the final prediction result.
Seek example_data_count_freq_ecDNA.txt
for the contact frequency information summary.
We developed a deep learning-based method that utilizes combined ideas from image classification as well as logistic regression model above to achieve a more specific and efficient ecDNA/HSR detection.
With the same directory hierarchy, this deep learning-based model only requires
matrix.mtx
for each cell as input.
python ecDNAcaller_deep.py <INPUT_PATH> <OUTPUT_PATH> <NUM_PROCESSES>
python ecDNAcaller_deep.py <INPUT_PATH> <OUTPUT_PATH> <NUM_PROCESSES>d # DEV_MODE
The script will output three summary files to the designated directory for 1) ecDNA alone, 2) HSR alone and 3) all
together (that can also be
processed by CMPlot.R
to generate a Manhattan plot) and a
bin-barcode binary matrix file in which row indices represent 10Mb bins from chr1 to chrX and column names are cell
barcodes. In the matrix, 0 represents None, 1 represents ecDNA and 2 represents HSR. Due to the algorithm design,
the first 2 bins of chromosome 1 and the last 2 bins of chromosome X will be padded with 0s.
The model currently runs on CPU only to allow for multiprocessing. Processing speed is at about 1.2 seconds per cell per core (Apple M1 Pro), which is approximately 5X faster than the logistic regression model. The deep learning-based model does not generate cache (individual files for each cell), but it can be fast enough to reprocess all cells in a reasonable amount of time after an interruption.
Note: to strictly reproduce the results in the paper, please add a d
at the end of the last parameter (without a
space) to
enable DEV mode, such as 32d
.
This will make the model ignore cells with a valid contact matrix file but without a valid CNV file (which is relatively
rare).
Thus, the validation of this deep learning-based model will be based
on the exactly same set of cells as the logistic regression-based model.
This is because CNV files are mandatory for the logistic regression-based model, but not required for the deep
learning-based model.
If DEV mode is disabled, the model will process all cells that have a readable contact matrix file, which may (in some
cases) result in a slightly different,
but more accurate result.
Note: please run CMPlot.R
in interactive mode for flexibility. modify variable assignment count_freq_file
to read
count frequency file.
Minimal requirements:
> ls -R example_data
example_data:
LC675_AAACGAAAGGGTTCTT LC675_CAGCCTTAGCTGATTC
...
example_data/LC675_AAACGAAAGGGTTCTT:
1000000.CNV.bedGraph matrix.mtx
> head 1000000.CNV.bedGraph
chr1 0 1000000 0.0
chr1 1000000 2000000 1.4897193720663133
chr1 2000000 3000000 1.267909097945521
chr1 3000000 4000000 1.3012399754809931
chr1 4000000 5000000 1.2818908405862612
chr1 5000000 6000000 0.5364818463718226
chr1 6000000 7000000 0.8262895800095347
chr1 7000000 8000000 1.3324624442235051
chr1 8000000 9000000 0.9503856707800673
chr1 9000000 10000000 1.4120115495979793
Note: _process.py
will add column names as:
'chr', 'start', 'end', 'cnv'
Thus, the columns must correspond.
> head matrix.mtx
chrom1 start1 end1 chrom2 start2 end2 count
chr1 1000000 2000000 chr1 1000000 2000000 41
chr1 1000000 2000000 chr1 9000000 10000000 1
chr1 1000000 2000000 chr1 18000000 19000000 1
chr1 1000000 2000000 chr1 19000000 20000000 1
chr1 1000000 2000000 chr6 148000000 149000000 1
chr1 1000000 2000000 chr9 20000000 21000000 1
chr1 1000000 2000000 chr19 24000000 25000000 1
chr1 1000000 2000000 chr20 3000000 4000000 1
chr1 1000000 2000000 chr20 61000000 62000000 1
- O. Tange (2011): GNU Parallel - The Command-Line Power Tool, The USENIX Magazine, February 2011:42-47.
- Yin, L. et al. rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated tool for Genome-Wide Association Study, Genomics, Proteomics & Bioinformatics (2021), doi: 10.1016/j.gpb.2020.10.007.
- The pandas development team (2020). pandas-dev/pandas: Pandas (Version latest). doi:10.5281/zenodo.3509134.
- Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2.
- mckib2. (n.d.). Mckib2/Pygini: Compute the gini index. GitHub. https://github.com/mckib2/pygini
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-performance deep learning library. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 721, 8026–8037.
- Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E.A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17(3), 261-272.
For any question, contact Ming Hu (hum@ccf.org), Jiachen Sun (jxs2269@case.edu), or submit an issue on GitHub.