Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for the zeros / NA discrepancy #275

Merged
merged 6 commits into from
May 22, 2024
Merged

fix for the zeros / NA discrepancy #275

merged 6 commits into from
May 22, 2024

Conversation

boopthesnoot
Copy link
Collaborator

Some of the tools' output which is used in alphapeptstats use zero meaning missing data, some use NA. Previously they were not unified so the preprocessing in _remove_na_values was wrong. Other changes are so the downstream code can work with NAs; style, and tests.

@boopthesnoot boopthesnoot requested a review from mschwoer May 17, 2024 11:19
@boopthesnoot boopthesnoot self-assigned this May 17, 2024
Copy link
Contributor

@mschwoer mschwoer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some probably dumb questions asked (I don't know this code base at all ;-))

alphastats/DataSet_Preprocess.py Outdated Show resolved Hide resolved
"""
square_sum_per_row = array.pow(2).sum(axis=1, skipna=True)

l2_norms = np.sqrt(square_sum_per_row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't get it to work properly, I don't think it works well with NaNs

n_jobs=2,
random_state=0,
verbose=0, #  random forest takes a while print progress
imp = sklearn.ensemble.HistGradientBoostingRegressor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same as RandomForest ? maybe introduce this as another method ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also based on multiple trees, so I don't know what would be right. I could call it gradient boosting, but this could be less known for non-technical people, although it would be more correct

@@ -30,45 +31,58 @@ def preprocess_print_info(self):
print(pd.DataFrame(self.preprocessing_info.items()))

def _remove_na_values(self, cut_off):
if (
self.preprocessing_info.get("Missing values were removed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about this preprocessing_info, but it seems to store information in human-readable keys.
This could lead to problems (say, I access something as self.preprocessing_info.get("Missing values were removed.") (did you spot the trailing dot? ;-))
I would suggest introducing a set of string constants, e.g.
MISSING_VALUES_REMOVED = "Missing values were removed"
somewhere and access this store exclusively through them

(not now, just for the future)

alphastats/gui/pages/02_Import Data.py Outdated Show resolved Hide resolved
alphastats/DataSet_Preprocess.py Show resolved Hide resolved
@boopthesnoot boopthesnoot merged commit dc517ed into main May 22, 2024
6 checks passed
@boopthesnoot boopthesnoot deleted the na_bugfix branch May 22, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants