-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Logistic Regression class implementation & utils #41
Conversation
Pull Request Test Coverage Report for Build 3931573895
💛 - Coveralls |
@gonuke This PR should be reviewed and merged first because it includes Coveralls says that coverage dropped because the SSML functionality of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct enough, but a few suggestions for maintainability.
Some good discussion points for a S/W meeting, too.
Co-authored-by: Paul Wilson <paul.wilson@wisc.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gonuke I have addressed some of the comments from our S/W review. Here are a few lingering comments, pending a re-review from you.
scripts/utils.py
Outdated
if stratified: | ||
cv = StratifiedKFold(n_splits=n_splits, random_state=random_state, | ||
shuffle=shuffle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no test for stratified KFold, but the only difference between this and standard KFold (which is tested) is a different scikit-learn class. I could devise a second test dataset with a few extra instances of one class (which would be balanced by StratifiedKFold
) or we could ignore testing this portion. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of # pragma: no cover
from Coveralls, I am adding to this IF...ELSE... since we are not testing StratifiedKFold.
scripts/utils.py
Outdated
Inputs: | ||
Lx: labeled feature data. | ||
Ux: unlabeled feature data. | ||
n: number of singular values to include in PCA analysis. | ||
''' | ||
|
||
pcadata = np.append(Lx, Ux, axis=0) | ||
normalizer = StandardScaler() | ||
x = normalizer.fit_transform(pcadata) | ||
logging.info(np.mean(pcadata), np.std(pcadata)) | ||
logging.info(np.mean(x), np.std(x)) | ||
|
||
pca = PCA(n_components=n) | ||
pca.fit_transform(x) | ||
logging.info(pca.explained_variance_ratio_) | ||
logging.info(pca.singular_values_) | ||
logging.info(pca.components_) | ||
|
||
principalComponents = pca.fit_transform(x) | ||
|
||
return principalComponents | ||
|
||
|
||
def _plot_pca_data(principalComponents, Ly, Uy, ax): | ||
''' | ||
Helper function for plot_pca that plots data for a given axis. | ||
Inputs: | ||
principalComponents: ndarray of shape (n_samples, n_components). | ||
Ly: class labels for labeled data. | ||
Uy: labels for unlabeled data (all labels should be -1). | ||
ax: matplotlib-axis to plot on. | ||
''' | ||
|
||
# only saving colors for binary classification with unlabeled instances | ||
col_dict = {-1: 'tab:gray', 0: 'tab:orange', 1: 'tab:blue'} | ||
|
||
for idx, color in col_dict.items(): | ||
indices = np.where(np.append(Ly, Uy, axis=0) == idx)[0] | ||
ax.scatter(principalComponents[indices, 0], | ||
principalComponents[indices, 1], | ||
c=color, | ||
label='class '+str(idx)) | ||
return ax | ||
|
||
|
||
def plot_pca(principalComponents, Ly, Uy, filename, n=2): | ||
''' | ||
A function for computing and plotting n-dimensional PCA. | ||
Inputs: | ||
principalComponents: ndarray of shape (n_samples, n_components). | ||
Ly: class labels for labeled data. | ||
Uy: labels for unlabeled data (all labels should be -1). | ||
filename: filename for saved plot. | ||
The file must be saved with extension .joblib. | ||
Added to filename if not included as input. | ||
n: number of singular values to include in PCA analysis. | ||
''' | ||
|
||
plt.rcParams.update({'font.size': 20}) | ||
|
||
alph = ["A", "B", "C", "D", "E", "F", "G", "H", | ||
"I", "J", "K", "L", "M", "N", "O", "P", | ||
"Q", "R", "S", "T", "U", "V", "W", "X", | ||
"Y", "Z"] | ||
jobs = alph[:n] | ||
|
||
# only one plot is needed for n=2 | ||
if n == 2: | ||
fig, ax = plt.subplots(figsize=(10, 8)) | ||
ax.set_xlabel('PC '+jobs[0], fontsize=15) | ||
ax.set_ylabel('PC '+jobs[1], fontsize=15) | ||
ax = _plot_pca_data(principalComponents, Ly, Uy, ax) | ||
ax.grid() | ||
ax.legend() | ||
else: | ||
fig, axes = plt.subplots(n, n, figsize=(15, 15)) | ||
for row in range(axes.shape[0]): | ||
for col in range(axes.shape[1]): | ||
ax = axes[row, col] | ||
# blank label plot | ||
if row == col: | ||
ax.tick_params( | ||
axis='both', which='both', | ||
bottom='off', top='off', | ||
labelbottom='off', | ||
left='off', right='off', | ||
labelleft='off' | ||
) | ||
ax.text(0.5, 0.5, jobs[row], horizontalalignment='center') | ||
# PCA results | ||
else: | ||
ax = _plot_pca_data(principalComponents, Ly, Uy, ax) | ||
|
||
if filename[-4:] != '.png': | ||
filename += '.png' | ||
fig.tight_layout() | ||
fig.savefig(filename) | ||
|
||
|
||
def plot_cf(testy, predy, title, filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gonuke, per our conversations from S/W, do you still concur that we can ignore testing these PCA/plotting functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - but is there a way to exclude them from the denominator of the coverage testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out, Coveralls supports adding # pragma: no cover
to blocks of code, and it will ignore it. Adding to the plotting functions above.
# since the test data used here is synthetic/toy data (i.e. uninteresting), | ||
# the trained model should be at least better than a 50-50 guess | ||
# if it was worse, something would be wrong with the ML class | ||
assert acc > 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some clarification to explain our "good-enough" testing of trained ML models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some questions about what the right test for parameters passed into the LogReg initializer....
models/LogReg.py
Outdated
random_state=self.random_state | ||
) | ||
else: | ||
if all(key in params.keys() for key in keys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the most correct/robust way to do this. The LogisticRegression
model has defaults for these parameters, so it may be OK if some are missing. You just need to make sure they exist if you want to pass them along. Right now, you only allow 0 parameters or all 3 parameters, but maybe it's OK for just 1 or 2?
One way to manage this is with the **kwargs
object that you can pass through, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my first time using **kwargs
but I saw a recommendation to use kwargs.pop('key', default_value)
to pull option args from the input. This system should support any combination of input parameters, including ones that are not supported. I have updated the __init__
and its relevant unit test. Let me know if you have feedback!
scripts/utils.py
Outdated
Inputs: | ||
Lx: labeled feature data. | ||
Ux: unlabeled feature data. | ||
n: number of singular values to include in PCA analysis. | ||
''' | ||
|
||
pcadata = np.append(Lx, Ux, axis=0) | ||
normalizer = StandardScaler() | ||
x = normalizer.fit_transform(pcadata) | ||
logging.info(np.mean(pcadata), np.std(pcadata)) | ||
logging.info(np.mean(x), np.std(x)) | ||
|
||
pca = PCA(n_components=n) | ||
pca.fit_transform(x) | ||
logging.info(pca.explained_variance_ratio_) | ||
logging.info(pca.singular_values_) | ||
logging.info(pca.components_) | ||
|
||
principalComponents = pca.fit_transform(x) | ||
|
||
return principalComponents | ||
|
||
|
||
def _plot_pca_data(principalComponents, Ly, Uy, ax): | ||
''' | ||
Helper function for plot_pca that plots data for a given axis. | ||
Inputs: | ||
principalComponents: ndarray of shape (n_samples, n_components). | ||
Ly: class labels for labeled data. | ||
Uy: labels for unlabeled data (all labels should be -1). | ||
ax: matplotlib-axis to plot on. | ||
''' | ||
|
||
# only saving colors for binary classification with unlabeled instances | ||
col_dict = {-1: 'tab:gray', 0: 'tab:orange', 1: 'tab:blue'} | ||
|
||
for idx, color in col_dict.items(): | ||
indices = np.where(np.append(Ly, Uy, axis=0) == idx)[0] | ||
ax.scatter(principalComponents[indices, 0], | ||
principalComponents[indices, 1], | ||
c=color, | ||
label='class '+str(idx)) | ||
return ax | ||
|
||
|
||
def plot_pca(principalComponents, Ly, Uy, filename, n=2): | ||
''' | ||
A function for computing and plotting n-dimensional PCA. | ||
Inputs: | ||
principalComponents: ndarray of shape (n_samples, n_components). | ||
Ly: class labels for labeled data. | ||
Uy: labels for unlabeled data (all labels should be -1). | ||
filename: filename for saved plot. | ||
The file must be saved with extension .joblib. | ||
Added to filename if not included as input. | ||
n: number of singular values to include in PCA analysis. | ||
''' | ||
|
||
plt.rcParams.update({'font.size': 20}) | ||
|
||
alph = ["A", "B", "C", "D", "E", "F", "G", "H", | ||
"I", "J", "K", "L", "M", "N", "O", "P", | ||
"Q", "R", "S", "T", "U", "V", "W", "X", | ||
"Y", "Z"] | ||
jobs = alph[:n] | ||
|
||
# only one plot is needed for n=2 | ||
if n == 2: | ||
fig, ax = plt.subplots(figsize=(10, 8)) | ||
ax.set_xlabel('PC '+jobs[0], fontsize=15) | ||
ax.set_ylabel('PC '+jobs[1], fontsize=15) | ||
ax = _plot_pca_data(principalComponents, Ly, Uy, ax) | ||
ax.grid() | ||
ax.legend() | ||
else: | ||
fig, axes = plt.subplots(n, n, figsize=(15, 15)) | ||
for row in range(axes.shape[0]): | ||
for col in range(axes.shape[1]): | ||
ax = axes[row, col] | ||
# blank label plot | ||
if row == col: | ||
ax.tick_params( | ||
axis='both', which='both', | ||
bottom='off', top='off', | ||
labelbottom='off', | ||
left='off', right='off', | ||
labelleft='off' | ||
) | ||
ax.text(0.5, 0.5, jobs[row], horizontalalignment='center') | ||
# PCA results | ||
else: | ||
ax = _plot_pca_data(principalComponents, Ly, Uy, ax) | ||
|
||
if filename[-4:] != '.png': | ||
filename += '.png' | ||
fig.tight_layout() | ||
fig.savefig(filename) | ||
|
||
|
||
def plot_cf(testy, predy, title, filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - but is there a way to exclude them from the denominator of the coverage testing?
@gonuke I addressed your comments, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
This introduces an ML class,
LogReg
, that can be used for supervised logistic regression. This includes typical scikit-learn-esque methods like train and predict as well as methods for hyperparameter optimization and saving the model to file.