Skip to content
This repository has been archived by the owner on Mar 29, 2022. It is now read-only.

Feature/calculateauc #35

Merged
merged 15 commits into from
Dec 31, 2021
Merged

Feature/calculateauc #35

merged 15 commits into from
Dec 31, 2021

Conversation

antschum
Copy link
Collaborator

@antschum antschum commented Dec 14, 2021

Description of changes
Implementation to calculate the AUC with respect to CellType along with test files.

Still missing right now:

  • is a nice results output. (Current version returns huge dictionary with everything..)
  • feature importances (still not sure 1. how to best return them - eg. do I take the average if multiple importances are captured and 2. feature importances for logistic regression -> Issue.)
  • progress bar for calculations

@antschum antschum changed the base branch from main to development December 14, 2021 16:39
@github-actions
Copy link

Hi @antschum,

It looks like this pull-request is has been made against the theislab/augurpy master or main branch.
The master/main branch should always contain code from the latest release.
Because of this, PRs to master/main are only allowed if they come from any theislab/augurpy release or patch branch.

You do not need to close this PR, you can change the target branch to development by clicking the "Edit" button at the top of this page.

Thanks again for your contribution!

@github-actions github-actions bot added the enhancement New feature or request label Dec 14, 2021
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool to see it coming together!

I'd suggest that you rename the Python files to ensure that they are descriptive.
calculate_AUC.py for example is not the nicest of all names.

Besides that just many tiny requests by me. Happy to see this merge soon! Then we'll also need to ensure that we properly document all of this. See #40

augurpy/calculate_AUC.py Outdated Show resolved Hide resolved
augurpy/calculate_AUC.py Outdated Show resolved Hide resolved
augurpy/calculate_AUC.py Outdated Show resolved Hide resolved
augurpy/calculate_AUC.py Outdated Show resolved Hide resolved
augurpy/calculate_AUC.py Outdated Show resolved Hide resolved
augurpy/calculate_AUC.py Outdated Show resolved Hide resolved
augurpy/cv.py Outdated Show resolved Hide resolved
augurpy/cv.py Outdated Show resolved Hide resolved
augurpy/cv.py Outdated Show resolved Hide resolved
augurpy/cv.py Outdated Show resolved Hide resolved
@Zethson
Copy link
Member

Zethson commented Dec 24, 2021

feature importances (still not sure 1. how to best return them - eg. do I take the average if multiple importances are captured and 2. feature importances for logistic regression -> Issue.)

Mhmm. Didn't we want to look at how Augur calculates them? Did you figure something out?

antschum and others added 2 commits December 25, 2021 09:15
Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net>
Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net>
@antschum
Copy link
Collaborator Author

antschum commented Dec 26, 2021

feature importances (still not sure 1. how to best return them - eg. do I take the average if multiple importances are captured and 2. feature importances for logistic regression -> Issue.)

Mhmm. Didn't we want to look at how Augur calculates them? Did you figure something out?

Augur just a huge matrix containing the raw data like so:
cell type subsample idx fold
dimensions: celltypes x folds x n_subsamples x n genes sampled
Ill implement this for starters, but am not super happy with it.. It would be cool to brainstorm on why this might be interesting to the user to beginn with.

@Zethson
Copy link
Member

Zethson commented Dec 26, 2021

I guess that this should help determine which genes contributed the most to the cell type prioritization?

@Zethson
Copy link
Member

Zethson commented Dec 28, 2021

@antschum Laptop working again? ^_^
Should I have another look?

@antschum
Copy link
Collaborator Author

antschum commented Dec 28, 2021

@antschum Laptop working again? ^_^

Hey:) using my brothers, should have a working one by Thursday though.

Should I have another look?

That would be great! Ill click on request.
Still missing feature importances for logistic regression (started reading, I think it sounds more complicated than it is, but don't fully understand it yet. There is not a lot online) and the concordance correlation coefficient for regression models, which I just have to implement.

@Zethson
Copy link
Member

Zethson commented Dec 28, 2021

@antschum think you missed two comments of mine. (You can scroll through the changed files.) But besides this it looks great! Feel free to merge it after you've addressed my comments.

Regarding the feature importances:

# logistic regression for feature importance
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define the model
model = LogisticRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

We can just make use of something like this, no?

Also, we will need to implement some plots for the feature importances. Would you mind creating an issue for this?

@antschum
Copy link
Collaborator Author

@antschum think you missed two comments of mine. (You can scroll through the changed files.) But besides this it looks great! Feel free to merge it after you've addressed my comments.

Ah! Im sorry, I totally missed those. I changed the variable name and replied to the other comment.

Regarding the feature importances:

# logistic regression for feature importance
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define the model
model = LogisticRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

We can just make use of something like this, no?

So would you suggest just using the coefficient values as feature importances for the logistic regression? In the original they referenced the agresti method here. I have not really found much on this, although in the blog post, it seems they are just subtracting the mean from each coefficient and then dividing by the standard deviation, right?

Also, we will need to implement some plots for the feature importances. Would you mind creating an issue for this?

Done! #42

@Zethson
Copy link
Member

Zethson commented Dec 31, 2021

Ah, wasn't aware of this. Yes, I read the blog the same way.
Basically, this part:

An alternative comparison of effects of quantitative predictors having different units uses standardized coefficients. The model is fitted to standardized predictors, replacing each xj by (xj−x¯j)/sxj. A 1-unit change in the standardized predictor is a standard deviation change in the original predictor. Then, each regression coefficient represents the effect of a standard deviation change in a predictor, controlling for the other variables. The standardized estimate for predictor xj is the unstandardized estimate β^j multiplied by sxj.

Feel free to merge it now :)
I'd suggest to implement the other things in follow up PRs.

@antschum
Copy link
Collaborator Author

Ah, wasn't aware of this. Yes, I read the blog the same way. Basically, this part:

An alternative comparison of effects of quantitative predictors having different units uses standardized coefficients. The model is fitted to standardized predictors, replacing each xj by (xj−x¯j)/sxj. A 1-unit change in the standardized predictor is a standard deviation change in the original predictor. Then, each regression coefficient represents the effect of a standard deviation change in a predictor, controlling for the other variables. The standardized estimate for predictor xj is the unstandardized estimate β^j multiplied by sxj.

Feel free to merge it now :) I'd suggest to implement the other things in follow up PRs.

Ah, I just read this and had already pushed - maybe you could take a quick look? I added some rough tests for the concordance correlation coefficient, so it should be more or less okay. And then we merge? My bad.
Doing the other things in followup PRs :)

@antschum
Copy link
Collaborator Author

Forgot to run nox. 🤦‍♀️

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

tests/test_cv.py Outdated Show resolved Hide resolved
@antschum antschum merged commit 09abf78 into development Dec 31, 2021
@Zethson Zethson deleted the feature/calculateauc branch February 21, 2022 11:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants