Sensitive detection of rare disease-associated cell subsets via representation learning #79

agitter · 2016-08-11T18:56:33Z

Rare cell populations play a pivotal role in the initiation and progression of diseases like cancer. However, the identification of such subpopulations remains a difficult task. This work describes CellCnn, a representation learning approach to detect rare cell subsets associated with disease using high dimensional single cell measurements. Using CellCnn, we identify paracrine signaling and AIDS onset associated cell subsets in peripheral blood, and minimal residual disease associated populations in leukemia with frequencies as low as 0.005%.

agitter · 2016-08-11T19:13:06Z

They appear to be using convolutional neural networks for unordered data. This seems strange to me. Has anyone seen that before in other domains?

cgreene · 2016-08-11T19:14:58Z

@agitter : Haven't seen it before, other than the graph stuff. How does convolution work without any sort of neighbor relationships? Isn't that the primary reason that one uses CNNs - to take advantage of that structure?

agitter · 2016-08-11T19:22:54Z

@cgreene Exactly, I always thought that the point of convolution was to take advantage of neighbor relationships in one or more dimensions. From my ~5 minute read of this paper, they seem to be creating artificial random neighbor relationships from the unordered data for each cell in a biological sample and repeating that process many times. This may be their workaround for the problem that cell i in sample x does not correspond to cell i in sample y.

cgreene · 2016-08-11T19:42:13Z

@agitter I agree that it is hard to figure that paper out. Back when we were first starting with these algorithms, we briefly considered clustering and using a dendrogram or similar to define neighbors for convolution. We decided that imposed too much structure and ended up moving forward with non-convolutional methods.

From my ~10 minute read they are not imposing order:

"In all our experiments, random cell subsets, drawn with replacement from the original cytometry samples, were used as multi-cell input training examples of CellCnn. "

They are doing something here

"If we are interested in extremely rare populations (abundance < 1%) then we use a modified procedure for creating multi-cell inputs. 50% of a multi-cell input is sampled uniformly at random from the whole cell population whereas the other 50% is sampled from cells with high outlierness score."

but I don't think that would impose the type of structure usually used for convolution.

I have e-mailed the authors a link to this to see if they can provide some clarity. I would love to know if they compared against approaches that don't impose the structure of a CNN.

hussius · 2016-08-11T19:44:17Z

@cgreene Great that you emailed them! I was similarly intrigued/mystified when I looked at this a few months back. I also ran into some issue that I can't recall now when trying to run the code. Would be interesting to get some more insight from the authors.

agitter · 2016-08-11T20:34:13Z

@cgreene I started at Figure 1a for a while, and it makes more sense. I erroneously read 2 or 3 convolutional filters as 2 or 3 convolutional layers. The network is actually quite small. My current understanding (subject to change) is that there is only 1 convolutional layer that contains 2 or 3 filters. Each filter is supposed to recognize a cell type signature; it transforms the mass cytometry marker values for a single cell into a scalar score. Then the pooling is over all of cells, which either detects whether the signature was seen in any of the input cells (max) or the frequency with which it was seen (mean). The output layer makes a prediction using only these 2 or 3 inputs.

If I understand correctly, it is a very special case of a convolutional network where the convolutional layer receptive field is 1 and the pooling layer receptive field is k, where k is the number of cells in the multi-cell input (e.g., 1000). So you are right that they are not artificially imposing an order among the cells.

Now I'm curious what happens when the number of filters increases.

gokceneraslan · 2016-08-15T21:14:53Z

@agitter Exactly, a bit unusual type of convnet where convolution simply corresponds to a dot product of a filter vector and the markers of a single cell since number of "channels" is equal to the number of markers, if I understood correctly. Couple of things crossed my mind:

The problem is formulated as a supervised learning problem, therefore they can also try a single layer MLP with three neurons (corresponding to 3 filters in this approach) without any kind of multiple-instance learning. So every cell can be an instance with binary labels i.e. disease status etc. That would certainly show what they gain by introducing MIL. Therefore an MLP without any MIL (or even a lasso model) would be a nice baseline, in my opinion.
They mentioned some difficulties of existing approaches e.g. many irrelevant features, overfitting etc. but these can be easily avoided by applying regularization to any known supervised learner (like lasso approach in Citrus). Runtime comparison with Citrus seems quite impressive, though.
Output layer is a fully connected layer, meaning that there is a scalar connected to each filter. So I was thinking, maybe interpreting just the filter activations of cells may not fully reflect what the network actually learned. Because these activations will also be weighted in the output layer. Hypothetically, there is a chance that; in a case where a cell activates filter 1 and 2, activation of filter 1 will not be propagated into the final output prediction since the weight of the filter 1 in the output layer is very low.

agitter · 2016-08-16T11:54:12Z

@gokceneraslan Comparing a MLP with and without MIL would be a good idea. They could also create a simple baseline by having each cytometry sample be an instance and using the marker means as input to the 3 hidden units. That could be contrasted with the MIL approach with max pooling and mean pooling.

Now width and height are both specified at 13 pixels, to constrain the aspect ratio of these SVGs as square. Previously, the icons appeared squished in DOCX exports. See manubot/rootstock#40

gwaybio mentioned this issue Aug 14, 2016

MiRTDL: a deep learning approach for miRNA target prediction #49

Open

This was referenced Aug 23, 2016

Refine our guiding question. #88

Closed

Accurate prediction of single-cell DNA methylation states using deep learning #39

Closed

Molecular Graph Convolutions: Moving Beyond Fingerprints #53

Closed

agitter mentioned this issue Sep 15, 2016

Problem with "wide-matrix" datasets for deep learning #95

Closed

agitter mentioned this issue Oct 21, 2016

Who is ready to start writing? #116

Closed

agitter mentioned this issue Apr 8, 2017

Initial draft of singlecell #299

Merged

cgreene closed this as completed Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sensitive detection of rare disease-associated cell subsets via representation learning #79

Sensitive detection of rare disease-associated cell subsets via representation learning #79

agitter commented Aug 11, 2016

agitter commented Aug 11, 2016

cgreene commented Aug 11, 2016

agitter commented Aug 11, 2016

cgreene commented Aug 11, 2016 •

edited

Loading

hussius commented Aug 11, 2016

agitter commented Aug 11, 2016

gokceneraslan commented Aug 15, 2016 •

edited

Loading

agitter commented Aug 16, 2016

Sensitive detection of rare disease-associated cell subsets via representation learning #79

Sensitive detection of rare disease-associated cell subsets via representation learning #79

Comments

agitter commented Aug 11, 2016

agitter commented Aug 11, 2016

cgreene commented Aug 11, 2016

agitter commented Aug 11, 2016

cgreene commented Aug 11, 2016 • edited Loading

hussius commented Aug 11, 2016

agitter commented Aug 11, 2016

gokceneraslan commented Aug 15, 2016 • edited Loading

agitter commented Aug 16, 2016

cgreene commented Aug 11, 2016 •

edited

Loading

gokceneraslan commented Aug 15, 2016 •

edited

Loading