Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Across all models, are certain cell painting features more explanatory than others? #64

Closed
gwaybio opened this issue Sep 13, 2019 · 5 comments
Labels
Experiments Tracking experimental questions, results, or analysis

Comments

@gwaybio
Copy link
Member

gwaybio commented Sep 13, 2019

Exploring model coefficients across all models, what does this distribution look like?

@gwaybio gwaybio added the Experiments Tracking experimental questions, results, or analysis label Sep 13, 2019
@gwaybio
Copy link
Member Author

gwaybio commented Sep 18, 2019

This paper might be an important resource Zahedi et al. 2018.

I haven't done the analysis listed above, but anecdotally, I have seen a bunch of mito features pop up with high weights.

@gwaybio
Copy link
Member Author

gwaybio commented Sep 20, 2019

model_coefficient_summary

Comparing coefficients across all 70 cell health models using real and shuffled models. The model coefficient sum is much higher in real data models. An, on average, it looks like the Mito channel is the highest compared to all other labeled channels.

Remaining Todo

  • I imagine that this result will be different depending on the actual cell health variable. Stratify the type of variable and re plot

@gwaybio
Copy link
Member Author

gwaybio commented Sep 20, 2019

cc @AnneCarpenter @shntnu

@gwaybio
Copy link
Member Author

gwaybio commented Sep 21, 2019

In 9a33ac0, I compare feature performances across cell lines.

cell_line_mse_differences

Interpretation

Not surprisingly, training with real data shows lower cell line specific MSE across features. These values are relatively consistent across cell lines as well, although it does appear that HCC44 has the lowest overall MSE.

The F statistic is tracking the ratio of between group variance over within group variance. So high values will map to features that have high performance differences across cell line. Low values indicate features that are predicted consistently across cell lines. There are some features that are predicted well across cell lines, and some that are predicted with higher variance. If the feature is predicted poorly in HCC44, it tends to have a high F stat. Not surprisingly, the F statistics are higher in shuffled data.

@gwaybio
Copy link
Member Author

gwaybio commented Nov 20, 2019

In #81, I add two additional visualizations:

Note that the axes represent the total sum of each individual coefficient across all models.

Top 50 Features

coefficient_sum_subset

All Features

coefficient_sum_full

Summary

  • Feature weights are much higher in real vs. shuffled data
  • Top features don't really participate in many models
    • This is sort of expected since feature selection is not applied a priori (so there are lots of redundant features here) and the regression models are elastic net
  • Not sure how to interpret coefficients with high weights!

@gwaybio gwaybio closed this as completed Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Experiments Tracking experimental questions, results, or analysis
Projects
None yet
Development

No branches or pull requests

1 participant