-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant difference in model coefficients for binomial distribution? #777
Comments
Please set |
There are a few more issues with your example:
E.g., with
import pandas as pd # v1.5.2
import tabmat # v3.1.13
import glum #v2.6.0
import statsmodels.api as sm # v0.13.5
import statsmodels.formula.api as smf
# Read data.
filename = 'train_data_binomial.csv'
train_data = pd.read_csv(filename, header = 0)
# Fit glum model.
dtypes = {'Year': 'category',
'Field16952': 'category',
'Field16995': 'float64',
'Field17024': 'float64',
'Field17041': 'float64',
'Field17045': 'float64',
}
pred_cols = [key for key in dtypes.keys()]
glum_train_data = train_data.astype(dtypes) # change 3 vars to categorical
glum_model = glum.GeneralizedLinearRegressor(family = glum.BinomialDistribution(), alpha = 0, gradient_tol=1e-7)
is_GLM = True
X = tabmat.from_pandas(df = glum_train_data[pred_cols], drop_first = is_GLM)
glum_model.fit(X, train_data['Response'], store_covariance_matrix = True)
# Fit statsmodels model.
# References are base levels for categorical features.
formula = "Response~C(Year, Treatment(reference=1.0))"
formula += "+C(Field16952, Treatment(reference=1.0))"
formula += "+Field16995+Field17024+Field17041" # all numeric here
formula += "+Field17045"
sm_fam = sm.families.Binomial()
sm_model = smf.glm(formula, train_data, family = sm_fam).fit()
# Show coef values.
glum_params = pd.concat([pd.Series(glum_model.intercept_), pd.Series(glum_model.coef_)])
sm_params = sm_model.params
glum_params.index = sm_params.index
print(pd.concat([glum_params, sm_params], axis=1))
print((glum_params-sm_params).abs().max()) |
@jtilly, thank you for your expeditious reply, I understand your response fully and it has helped greatly. I believe my problem then boils down to the problem of covariance parameters. I can open a new issue for this if necessary, as the comment that follows is only tangentially related. I have used the code below to return to a Tweedie problem, which is your code plus supplementary functions. Here, I obtain standard errors for the prediction means for both In the code, I first do a manual calculation method (derived from I don't believe it's absolutely essential to understand all details of the functions here; I want to place emphasis on the fact that the covariance parameters for each of Edit: I have been drafting this standard error and confidence interval method for #756. Detailsimport numpy as np # v1.26.3
import pandas as pd # v1.5.2
import tabmat # v3.1.13
import glum #v2.6.0
import statsmodels.api as sm # v0.13.5
import statsmodels.formula.api as smf
from scipy import stats # v1.12.0
from patsy import dmatrix
# Read data.
filename = 'train_data_tweedie.csv'
train_data = pd.read_csv(filename, header = 0)
# Parameters for both models.
tw_param = 1.6
yr_ref = 7
# Fit glum model.
data_year, data_response = 'Year', 'Response'
cat_dtypes = {data_year: 'category',
'Field16952': 'category',
'Field16995': 'float64',
'Field17024': 'float64',
'Field17041': 'float64',
'Field17041_e4': 'category',
'Field17045': 'float64',
}
pred_cols = [key for key in cat_dtypes.keys()]
vars_no_yr = [var for var in pred_cols if var != data_year]
glum_train_data = train_data[pred_cols].astype(cat_dtypes) # change 3 vars to categorical
family = glum.TweedieDistribution(power = tw_param)
glum_model = glum.GeneralizedLinearRegressor(family = family, alpha = 0, gradient_tol = 1e-7)
is_GLM = True
X = tabmat.from_pandas(df = glum_train_data[pred_cols], drop_first = is_GLM)
glum_model.fit(X, train_data['Response'], store_covariance_matrix = True)
# Fit statsmodels model.
# References are base levels for categorical features.
formula = "Response~C(Year, Treatment(reference=1.0))"
formula += "+C(Field16952, Treatment(reference=1.0))"
formula += "+Field16995+Field17024+Field17041+C(Field17041_e4, Treatment(reference=1.0))" # all numeric here
formula += "+Field17045"
sm_fam = sm.families.Tweedie(var_power = tw_param)
sm_model = smf.glm(formula, train_data, family = sm_fam).fit()
# Show coef values.
glum_params = pd.concat([pd.Series(glum_model.intercept_), pd.Series(glum_model.coef_)])
sm_params = sm_model.params
glum_params.index = sm_params.index # will be off by 1, b/c statsmodels doesn't keeep same order
#print(pd.concat([glum_params, sm_params], axis=1))
#print((glum_params-sm_params).abs().max())
# Do predictions.
def predict_w_CI(
model: glum._glm.GeneralizedLinearRegressor,
vars_no_yr: list[str],
alpha: float = 0.05,
):
"""
Get model predictions, errors, and confidence intervals for the given
records.
Parameters
----------
model : glum._glm.GeneralizedLinearRegressor
The fitted glum model.
vars_no_yr : list
List of variables in this model, excluding the year field.
alpha : float, default=0.05
Confidence level of the predictions.
Returns
-------
dt_dedup : pandas.core.frame.DataFrame
Predictions, errors, and confidence intervals for predictions of the
given dataset. Duplicate records have been removed to save time
(only looking at unique combinations of the given field values).
"""
# Create deduplicated data.
# Read in more data. This is the entire dataset. We want to predict on a
# deduplicated version of this.
data_all = pd.read_csv('input_data.csv', header = 0, dtype = cat_dtypes)
all_vars = [data_year] + vars_no_yr
get_cols = [data_response] + all_vars
all_data = data_all[get_cols].astype({data_year: 'category'}).copy()
# Drop duplicate rows. Can merge to rest of orig df later.
yr_data, response_data = all_data[data_year], all_data[data_response]
dt_dedup = data_all[vars_no_yr].drop_duplicates().copy()
dedup_idx = dt_dedup.index
# Add year and response data back after dropping duplicates.
dt_dedup.insert(0, data_year, yr_data.loc[dedup_idx])
dt_dedup.insert(0, data_response, response_data.loc[dedup_idx])
dt_dedup = dt_dedup.reset_index().drop('index', axis = 1)
# Set exogs for var pred mean calculation.
exog, res_cols, pred_exog = get_exog(dt_dedup[all_vars])
covb = model.covariance_matrix_ # get covariance parameters
preds = model.predict(pred_exog) # calculate model predictions
predicted_mean = np.array(preds)
# Get prediction errors and CIs for each prediction.
res = calc_SE_CIs(exog, covb, predicted_mean, alpha)
return pd.concat([dt_dedup, res], axis = 1)
def get_exog(
X_in: pd.core.frame.DataFrame,
mod_type: str = 'glum',
):
"""
Get the exog data to be input into prediction for standard errors and
confidence intervals.
Parameters
----------
X_in : pandas.core.frame.DataFrame
Input dataset to create exog data from.
Returns
-------
exog : numpy.ndarray
Input prediction dataset in array format, with ordered dummy columns
for categorical fields.
res_cols : list[str]
Column names of the dataframe version of the exog array.
pred_exog : numpy.ndarray
Prediction dataset. Only used in pred_w_CIs. Same as exog data, except
the base year has been set to 1 for all records.
"""
# Set exog dataframe with dummy columns for categoricals.
all_vars = X_in.columns.tolist()
pred_cols_no_excl = all_vars.copy()
[pred_cols_no_excl.remove(c) for c in all_vars if '_e' in c]
exog_df = pd.get_dummies(X_in, drop_first = is_GLM)
# Sort columns so the dataframe is in the right order.
dummy_cols = exog_df.columns.tolist()
yr_cols = [c for c in dummy_cols if data_year in c]
# Remove inforce year.
yr_cols = yr_cols[:-1]
non_yr = [c for c in dummy_cols if data_year not in c]
non_yr = [[c1 for c1 in non_yr if c2 in c1] for c2 in pred_cols_no_excl]
non_yr = [sorted(c, key = len) for c in non_yr if c != []]
non_yr = [x for xs in non_yr for x in xs]
res_cols = yr_cols + non_yr
exog_df = exog_df[res_cols]
# Set years to 0 to remove effect.
exog_df[yr_cols] = 0
# Set base year values.
pred_exog_df = exog_df.copy()
pred_exog_df[data_year + '_' + str(yr_ref)] = 1
pred_exog = np.atleast_2d(np.asarray(pred_exog_df))
# Create exog used for error and CIs.
exog_df.insert(0, 'intercept', 1)
exog = np.atleast_2d(np.asarray(exog_df))
return exog, res_cols, pred_exog
def calc_SE_CIs(
exog: np.ndarray,
covb: np.array,
predicted_mean: np.array,
alpha: float,
):
"""
Do calculations for standard errors and confidence intervals.
Parameters
----------
exog : numpy.ndarray
Input dataset to do calculations on.
covb : numpy.array
Model covariance parameters.
predicted_mean : numpy.array
Predicted means for the data given in exog.
alpha : float
Alpha value (confidence level).
Returns
-------
res : pandas.core.frame.DataFrame
Results dataframe with mean predictions, standard errors, and
confidence intervals.
"""
# Calculate variance and standard errors of the predicted mean.
var_pred_mean = (exog * np.dot(covb, exog.T).T).sum(1)
se = np.sqrt(var_pred_mean) # obs?
# Get the confidence intervals using log of predicted mean and se.
predicted_mean = np.log(predicted_mean) # only for log link?
q = stats.norm.ppf(1 - alpha / 2.0) # dist_args?
lower = predicted_mean - q * se
upper = predicted_mean + q * se
ci_mean = np.column_stack((lower, upper))
# Reset predicted mean, standard errors, CIs to exponentials.
predicted_mean = np.exp(predicted_mean) # only for log link?
se_mean = np.sqrt(predicted_mean**2 * var_pred_mean)
ci = np.exp(ci_mean) # only for log link?
# Create dataframe of results.
res = pd.DataFrame()
res['mean'] = predicted_mean
res['mean_se'] = se_mean
res['mean_ci_lower'] = ci[:, 0]
res['mean_ci_upper'] = ci[:, 1]
return res
# glum predictions
dt_dedup = predict_w_CI(glum_model, vars_no_yr)
print('\nglum values:\n', dt_dedup[['mean', 'mean_se', 'mean_ci_lower', 'mean_ci_upper']])
# statsmodels predictions
dt_dedup_sm = dt_dedup[pred_cols].copy()
dt_dedup_sm[data_year] = yr_ref # set year to reference
num_dtypes = {data_year: 'int64', # need to do categoricals as int input
'Field16952': 'int64',
'Field16995': 'float64',
'Field17024': 'float64',
'Field17041': 'float64',
'Field17041_e4': 'int64',
'Field17045': 'float64',
}
dt_dedup_sm = dt_dedup_sm.astype(num_dtypes)
summary_frame = sm_model.get_prediction(dt_dedup_sm).summary_frame(alpha = 0.05)
print('\nstatsmodels values (summary frame fn):\n', summary_frame)
# Same as manual calculation:
exog = np.atleast_2d(np.asarray(dmatrix(sm_model.model.data.design_info, dt_dedup_sm.copy())))
covb = np.array(sm_model.cov_params())
predicted_mean = np.array(summary_frame['mean'])
res = calc_SE_CIs(exog, covb, predicted_mean, alpha = 0.05)
print('\nstatsmodels values (manual calculation):\n', res)
# Results are different because covariance parameters are different between models.
print('\nglum cov params:\n', pd.DataFrame(glum_model.covariance_matrix_))
print('\statsmodels cov params:\n', pd.DataFrame(covb)) # covb from statsmodels above New data files used: |
I haven't checked your code in detail, but it's expected that standard errors between statsmodels and glum don't quite line up, we even have tests for it: Poisson / Normal / Logit: Lines 1887 to 1894 in d65ac0d
|
@jtilly I see. No need to worry about the details too much for the code. I think the main part is the Is it expected then that the standard errors of import seaborn as sns
col_name = 'Ratio of glum SEs to statsmodels SEs'
ratio = pd.DataFrame(dt_dedup['mean_se'] / res['mean_se'])
ratio = ratio.rename(columns = {'mean_se': col_name})
sns.histplot(data = ratio, x = col_name) |
Yes, as you an see in our test suite, we're doing a degree of freedom correction and statsmodels isn't. |
@jtilly right, thanks. In that case, neither method is necessarily wrong, correct? Does utilizing degrees of freedom benefit the standard error calculation in a way that the |
I am closing this because the original question has been answered. For general statistics question such as "Why would one correct for degrees of freedom", there are excellent resources elsewhere such as Stackoverflow. |
I'm converting some GLM code from
statsmodels
toglum
, and I want to maintain the correct outputs of the models while doing so. For a Tweedie distribution, I have noticed that the model coefficients are quite similar between the packages (at least for numeric fields):I have moved on to a test of the same dataset with a binomial distributed response. Here, I have noticed that the coefficients determined are now quite different between the packages, for both numeric and categorical fields:
Is there an explanation for how the binomial coefficients in
glum
could be so drastically different fromstatsmodels
? In this case, theglum
model coefficients are far closer to 0 than thestatsmodels
coefficients. (This has had an effect on the standard error and confidence interval calculations I have drafted in another post.)I have included a small replicable code example below, as well as the dataset used.
Requirements:
train_data_binomial.csv (updated)
The text was updated successfully, but these errors were encountered: