Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIDDLE: An integrative deep learning framework for functional genomic data inference #110

Open
laserson opened this issue Oct 18, 2016 · 6 comments

Comments

@laserson
Copy link

Paper:
http://biorxiv.org/content/early/2016/10/17/081380

Authors:
Umut Eser, L. Stirling Churchman

Abstract:

Numerous advances in sequencing technologies have revolutionized genomics through generating many types of genomic functional data. Statistical tools have been developed to analyze individual data types, but there lack strategies to integrate disparate datasets under a unified framework. Moreover, most analysis techniques heavily rely on feature selection and data preprocessing which increase the difficulty of addressing biological questions through the integration of multiple datasets. Here, we introduce FIDDLE (Flexible Integration of Data with Deep LEarning) an open source data-agnostic flexible integrative framework that learns a unified representation from multiple data types to infer another data type. As a case study, we use multiple Saccharomyces cerevisiae genomic datasets to predict global transcription start sites (TSS) through the simulation of TSS-seq data. We demonstrate that a type of data can be inferred from other sources of data types without manually specifying the relevant features and preprocessing. We show that models built from multiple genome-wide datasets perform profoundly better than models built from individual datasets. Thus FIDDLE learns the complex synergistic relationship within individual datasets and, importantly, across datasets.

@cgreene
Copy link
Member

cgreene commented Oct 18, 2016

I saw this come across my twitter feed yesterday. I haven't read it beyond the abstract, but the abstract seemed quite exciting. The abstract reminded me a fair bit of this recently published paper:
http://www.nature.com/articles/ncomms13090

@agitter
Copy link
Collaborator

agitter commented Oct 18, 2016

@cgreene That Nature Communications paper (MOMA) appears to have fairly different goals from FIDDLE. MOMA pertains to different types of entities in molecular networks (transcripts, proteins, metabolites, etc.) and phenotypes. FIDDLE is tailored to sequencing-based genomic data.

I opened #112 if we want to discuss MOMA, which uses a recurrent neural network.

@agitter
Copy link
Collaborator

agitter commented Oct 18, 2016

Biology

  • The overall framework is to take DNA sequence and k genomic datasets (i.e. genomic tracks) as input and predict a different genomic dataset
  • Goals include minimal preprocessing of each individual dataset to make the approach more generalizable and an integrative approach to combine information from multiple input tracks
  • The specific application here predicts Transcription Start Site-seq (TSS-seq) from NET-seq, MNase-seq, ChIP-seq, RNA-seq, and DNA sequence

Computational aspects

  • The high-level architecture builds a convolutional module for each type of input data and then joins these with a scaffold layer
  • Each convolutional module has 2 convolutional layers and a fully connected layer
  • The scaffold layer takes the individual convolutional modules' outputs as input and adds another convolutional layer and fully connected layer
  • The instances are 129k 500 base pair windows; if I understand correctly, these come from the 2 kb window around each gene's start site (generating 4 instances per yeast gene?) (edit: see below)
  • Split the instances into 128k training and 1k testing
  • No hyperparameter optimization, but the performance was not sensitive to hyperparameters (results not shown)
  • Evaluation splits the 500 bp window into 10 subwindows; true positive is when "both the model
    prediction and the TSS-seq data have their maxima within the same bin"
  • Compare models trained on a single type of input (50.2 - 61.2% accuracy) with the cobmined model (72.6% accuracy) and random input (10.5% accuracy)
  • 81.2% accuracy when one biological replicate (or replicates?) predicts another is seen as the upper bound on performance (edit: 3 biological replicates)
  • Assess whether the different input types are necessary and sufficient by replacing one or more types of input data with the mean value
  • TensorFlow and Torch implementations are available https://github.com/ueser/FIDDLE

Other questions and comments

  • Results could be sensitive to the train/test split (edit: see below)
  • I may be misinterpreting how the instances are generated and/or the evaluation; if there are actually 4 windows per gene start site, then wouldn't only one of them contain the true positive start site per the TSS-seq data?
  • Would this evaluation metric generalize to other genomic datasets? Picking the correct 50 bp window of the 10 options may be specific to TSS-seq. (edit: KL divergence is the primary metric)
  • They do not compare to any other classifiers so it is hard to say how much of the performance comes from the neural network architecture and how much comes from the structure in the data

Edited to reflect Umut's response below

@ueser
Copy link

ueser commented Oct 18, 2016

Thank you for the comments Anthony. This is a very nice working example of preprint review. Your comments are very useful to improve the work.

I may be misinterpreting how the instances are generated and/or the evaluation; if there are actually 4 windows per gene start site, then wouldn't only one of them contain the true positive start site per the TSS-seq data?

81.2% accuracy when one biological replicate (or replicates?) predicts another is seen as the upper bound on performance

The points I noticed that we missed in the paper, which will definitely be added in the updated version, are:

  • 129K 500bp windows are taken 1kb around the genes' start sites with a stride of 20bp, i.e. sliding the 500bp window by 20 bp. Therefore, each gene generates (1kb-500bp)/20bp = 25 samples.
  • There were 3 biological replicates for TSS-seq and it makes 3 pairs of accuracy measure, 6 pairs of KL-divergence measure (as KL-divergence is asymmetric). Then we averaged all the accuracy and losses for replicate pairs.

Results could be sensitive to the train/test split

FIDDLE splits the data randomly during the runtime, not based on a pre-defined train/test sets. Therefore, every time you start training, it will make a different set of test data. I ran the code many time and I did not see any dependency on the test data.

Another point:
We did not split the data into train, test and validation as my purpose was not to optimize the hyper-parameters. I saw many people who think that if the hyper parameters of a neural network were not optimized, it wont work. However, if you stay within a reasonable parameter range as described in Angermueller et al. 2016, it generally works. But, I think it is important not to scare wet-bench biologists with such details.

Would this evaluation metric generalize to other genomic datasets? Picking the correct 50 bp window of the 10 options may be specific to TSS-seq.

The accuracy metric is not the main evaluation metric as it does not satisfy our goal to predict the genomics dataset. For example, a gaussian distribution that always peaks at the major transcription start site (highest TSS-seq signal) would give 100% accuracy, but would lack the information of minor TSS positions and would not simulate the TSS-seq. The main evaluation metric is Kullback-Leibler divergence, or in other words, relative entropy, which measures the divergence between two probability distributions. This evaluation metric can be generalized to other genomic datasets. We used accuracy metric in TSS-seq prediction case because it is more intuitive.

They do not compare to any other classifiers so it is hard to say how much of the performance comes from the neural network architecture and how much comes from the structure in the data

The main point of the paper is not achieving the highest performance, although it might have achieved compared to other classifiers. We aimed to show that multimodal convolutional neural networks can bypass the feature specification for integrative genomics. For example, we do not specify the nucleosome positions, do not detect ChIP-seq peaks etc. to predict single nucleotide resolution TSS distribution, which is a non-parametric arbitrary distribution. On the other hand, the question whether the performance comes from the structure or the model is partially addressed by the sufficiency and necessity analysis. However, I agree that we should compare with other classifiers to see if the multimodal ConvNets have a clear performance advantage in terms of predicting the peak positions.

Overall, I find your comments really valuable.

@agitter
Copy link
Collaborator

agitter commented Oct 19, 2016

@ueser Thanks for your feedback, that clears up my questions. I edited my summary to correct the inaccuracies.

I ran the code many time and I did not see any dependency on the test data.

That's good to see. In some datasets we've worked with recently the performance can be 3x different depending on the train/test split.

However, I agree that we should compare with other classifiers to see if the multimodal ConvNets have a clear performance advantage in terms of predicting the peak positions.

Part of the reason I asked about comparing to other classifiers is for the questions we are asking in this review. And if you want to contribute in any way to the review, it isn't too late to join in.

@jbloom22
Copy link

jbloom22 commented Dec 13, 2016

Umut also discusses these issues in his presentation on Fiddle from the Models, Inference & Algorithms seminar at the Broad Institute.

dhimmel added a commit to dhimmel/deep-review that referenced this issue Feb 16, 2018
git's user.name config was being set to first names only, since
names are space separated. Add quotes to prevent this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants