Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensitive detection of rare disease-associated cell subsets via representation learning #79

Closed
agitter opened this issue Aug 11, 2016 · 8 comments

Comments

@agitter
Copy link
Collaborator

agitter commented Aug 11, 2016

http://doi.org/10.1101/046508

Rare cell populations play a pivotal role in the initiation and progression of diseases like cancer. However, the identification of such subpopulations remains a difficult task. This work describes CellCnn, a representation learning approach to detect rare cell subsets associated with disease using high dimensional single cell measurements. Using CellCnn, we identify paracrine signaling and AIDS onset associated cell subsets in peripheral blood, and minimal residual disease associated populations in leukemia with frequencies as low as 0.005%.

@agitter
Copy link
Collaborator Author

agitter commented Aug 11, 2016

They appear to be using convolutional neural networks for unordered data. This seems strange to me. Has anyone seen that before in other domains?

@cgreene
Copy link
Member

cgreene commented Aug 11, 2016

@agitter : Haven't seen it before, other than the graph stuff. How does convolution work without any sort of neighbor relationships? Isn't that the primary reason that one uses CNNs - to take advantage of that structure?

@agitter
Copy link
Collaborator Author

agitter commented Aug 11, 2016

@cgreene Exactly, I always thought that the point of convolution was to take advantage of neighbor relationships in one or more dimensions. From my ~5 minute read of this paper, they seem to be creating artificial random neighbor relationships from the unordered data for each cell in a biological sample and repeating that process many times. This may be their workaround for the problem that cell i in sample x does not correspond to cell i in sample y.

@cgreene
Copy link
Member

cgreene commented Aug 11, 2016

@agitter I agree that it is hard to figure that paper out. Back when we were first starting with these algorithms, we briefly considered clustering and using a dendrogram or similar to define neighbors for convolution. We decided that imposed too much structure and ended up moving forward with non-convolutional methods.

From my ~10 minute read they are not imposing order:

"In all our experiments, random cell subsets, drawn with replacement from the original cytometry samples, were used as multi-cell input training examples of CellCnn. "

They are doing something here

"If we are interested in extremely rare populations (abundance < 1%) then we use a modified procedure for creating multi-cell inputs. 50% of a multi-cell input is sampled uniformly at random from the whole cell population whereas the other 50% is sampled from cells with high outlierness score."

but I don't think that would impose the type of structure usually used for convolution.

I have e-mailed the authors a link to this to see if they can provide some clarity. I would love to know if they compared against approaches that don't impose the structure of a CNN.

@hussius
Copy link

hussius commented Aug 11, 2016

@cgreene Great that you emailed them! I was similarly intrigued/mystified when I looked at this a few months back. I also ran into some issue that I can't recall now when trying to run the code. Would be interesting to get some more insight from the authors.

@agitter
Copy link
Collaborator Author

agitter commented Aug 11, 2016

@cgreene I started at Figure 1a for a while, and it makes more sense. I erroneously read 2 or 3 convolutional filters as 2 or 3 convolutional layers. The network is actually quite small. My current understanding (subject to change) is that there is only 1 convolutional layer that contains 2 or 3 filters. Each filter is supposed to recognize a cell type signature; it transforms the mass cytometry marker values for a single cell into a scalar score. Then the pooling is over all of cells, which either detects whether the signature was seen in any of the input cells (max) or the frequency with which it was seen (mean). The output layer makes a prediction using only these 2 or 3 inputs.

If I understand correctly, it is a very special case of a convolutional network where the convolutional layer receptive field is 1 and the pooling layer receptive field is k, where k is the number of cells in the multi-cell input (e.g., 1000). So you are right that they are not artificially imposing an order among the cells.

Now I'm curious what happens when the number of filters increases.

@gokceneraslan
Copy link

gokceneraslan commented Aug 15, 2016

@agitter Exactly, a bit unusual type of convnet where convolution simply corresponds to a dot product of a filter vector and the markers of a single cell since number of "channels" is equal to the number of markers, if I understood correctly. Couple of things crossed my mind:

  • The problem is formulated as a supervised learning problem, therefore they can also try a single layer MLP with three neurons (corresponding to 3 filters in this approach) without any kind of multiple-instance learning. So every cell can be an instance with binary labels i.e. disease status etc. That would certainly show what they gain by introducing MIL. Therefore an MLP without any MIL (or even a lasso model) would be a nice baseline, in my opinion.
  • They mentioned some difficulties of existing approaches e.g. many irrelevant features, overfitting etc. but these can be easily avoided by applying regularization to any known supervised learner (like lasso approach in Citrus). Runtime comparison with Citrus seems quite impressive, though.
  • Output layer is a fully connected layer, meaning that there is a scalar connected to each filter. So I was thinking, maybe interpreting just the filter activations of cells may not fully reflect what the network actually learned. Because these activations will also be weighted in the output layer. Hypothetically, there is a chance that; in a case where a cell activates filter 1 and 2, activation of filter 1 will not be propagated into the final output prediction since the weight of the filter 1 in the output layer is very low.

@agitter
Copy link
Collaborator Author

agitter commented Aug 16, 2016

@gokceneraslan Comparing a MLP with and without MIL would be a good idea. They could also create a simple baseline by having each cytometry sample be an instance and using the marker means as input to the 3 hidden units. That could be contrasted with the MIL approach with max pooling and mean pooling.

dhimmel pushed a commit to dhimmel/deep-review that referenced this issue Nov 3, 2017
Now width and height are both specified at 13 pixels, to constrain the aspect ratio of these SVGs as square. Previously, the icons appeared squished in DOCX exports. See manubot/rootstock#40
@cgreene cgreene closed this as completed Mar 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants