-
Notifications
You must be signed in to change notification settings - Fork 0
/
future_developers_TODOs.txt
18 lines (16 loc) · 3.13 KB
/
future_developers_TODOs.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Some TODOs for future developers of TAaCGH
+ Migrate 4_hom_stats_parts.py to R using persistent homology packages from R like GUDHI or TDA. Reasons: jplex is obsolete for which we need to include an executable java file (plex.jar) that is difficult to share. Asking for the user to have Java and the right version of Python requires an unnecessary extra mile for the user. The script also requires to transpose the original data which is usually very big. If the user works with several datasets it will require a lot of storage space.
+ Modify the program to stop using indexes used to relate the phenotype file with the aCGH file. That is recipe for disaster. An option is to use the column name to relate the files.
+ update 1_impute_aCGH.R to work with any file, or remove it and modify the manual
+ 11_class_pat_seg.R: need to create a warning when adding columns to phenotype file to avoid overwriting variables
+ Cleaning up the files from unused functions
+ Restructure/redesign to convert the set of scripts in a final product, preferable using only one script. A good starting point would be to collapse in one script: 4_hom_stats_parts.py, 5_sig_pcalc_parts.R and 6_FDR.R to rename it something like significant_sections.R. Scripts 1 to 3 are pre-processing and can be seen as tools
+ Improve outlier handling in 5_sig_pcalc_parts.R
+ Up to now 6_FDR.R can only by run when there is more than 1 part. It is easy to handled by making simple changes but need to be incoporated to the script, specially when the number of the part is different than one (which it is something we sometimes need to do when running B1 and trying to find amplicons)
+ The following scripts are useful but need an efficient way to pass parameters and adjust titles for generated files an graphs: ind_prof_origpat_local _sect.R, ind_prof_origpat_local.R, vis_avg_betti_curves.R
Some useful comments
+ CGH_start: you will see this variable in different programs and is VERY important. It indicates the program where does the information from the patients starts. Some times is the exact location of the first patient (in SET_dat_full.txt will be 6) but some times is that number minus 1 (5)
+ 4_hom_stats.py: this program generates B0 and B1 curves for specific dimensions. Originally it was designed to generate with some kind of algorithm all the possible dimensions that could be used. Later on it was modified to run specific dimensions
+ Phenotype file: must programs use the phenotype. Originally, previous developers had the phenotype selection embedded in each program, that was problematic because each new data set had a different format. That was modified but remnants of the original coding might be still present, specially some unused functions
Pending Research
+ The area between the curves does not take in consideration yet if test or control are on top. It might be also adding up areas where test is on top with areas where control is on top after each curve intersection. Research is pending on how to efficiently improve the statistic. Georgina Gonzalez did some research about it and share it with Javier Arsuaga. Machine learning was also thought to be aosmething to consider to approach the issue.