-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIDDLE: An integrative deep learning framework for functional genomic data inference #110
Comments
I saw this come across my twitter feed yesterday. I haven't read it beyond the abstract, but the abstract seemed quite exciting. The abstract reminded me a fair bit of this recently published paper: |
@cgreene That Nature Communications paper (MOMA) appears to have fairly different goals from FIDDLE. MOMA pertains to different types of entities in molecular networks (transcripts, proteins, metabolites, etc.) and phenotypes. FIDDLE is tailored to sequencing-based genomic data. I opened #112 if we want to discuss MOMA, which uses a recurrent neural network. |
Biology
Computational aspects
Other questions and comments
Edited to reflect Umut's response below |
Thank you for the comments Anthony. This is a very nice working example of preprint review. Your comments are very useful to improve the work.
The points I noticed that we missed in the paper, which will definitely be added in the updated version, are:
FIDDLE splits the data randomly during the runtime, not based on a pre-defined train/test sets. Therefore, every time you start training, it will make a different set of test data. I ran the code many time and I did not see any dependency on the test data. Another point:
The accuracy metric is not the main evaluation metric as it does not satisfy our goal to predict the genomics dataset. For example, a gaussian distribution that always peaks at the major transcription start site (highest TSS-seq signal) would give 100% accuracy, but would lack the information of minor TSS positions and would not simulate the TSS-seq. The main evaluation metric is Kullback-Leibler divergence, or in other words, relative entropy, which measures the divergence between two probability distributions. This evaluation metric can be generalized to other genomic datasets. We used accuracy metric in TSS-seq prediction case because it is more intuitive.
The main point of the paper is not achieving the highest performance, although it might have achieved compared to other classifiers. We aimed to show that multimodal convolutional neural networks can bypass the feature specification for integrative genomics. For example, we do not specify the nucleosome positions, do not detect ChIP-seq peaks etc. to predict single nucleotide resolution TSS distribution, which is a non-parametric arbitrary distribution. On the other hand, the question whether the performance comes from the structure or the model is partially addressed by the sufficiency and necessity analysis. However, I agree that we should compare with other classifiers to see if the multimodal ConvNets have a clear performance advantage in terms of predicting the peak positions. Overall, I find your comments really valuable. |
@ueser Thanks for your feedback, that clears up my questions. I edited my summary to correct the inaccuracies.
That's good to see. In some datasets we've worked with recently the performance can be 3x different depending on the train/test split.
Part of the reason I asked about comparing to other classifiers is for the questions we are asking in this review. And if you want to contribute in any way to the review, it isn't too late to join in. |
Umut also discusses these issues in his presentation on Fiddle from the Models, Inference & Algorithms seminar at the Broad Institute. |
git's user.name config was being set to first names only, since names are space separated. Add quotes to prevent this.
Paper:
http://biorxiv.org/content/early/2016/10/17/081380
Authors:
Umut Eser, L. Stirling Churchman
Abstract:
The text was updated successfully, but these errors were encountered: