Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant difference in model coefficients for binomial distribution? #777

Closed
enriquecardenas24 opened this issue Mar 5, 2024 · 8 comments

Comments

@enriquecardenas24
Copy link

enriquecardenas24 commented Mar 5, 2024

I'm converting some GLM code from statsmodels to glum, and I want to maintain the correct outputs of the models while doing so. For a Tweedie distribution, I have noticed that the model coefficients are quite similar between the packages (at least for numeric fields):

image
image

I have moved on to a test of the same dataset with a binomial distributed response. Here, I have noticed that the coefficients determined are now quite different between the packages, for both numeric and categorical fields:

image
image

Is there an explanation for how the binomial coefficients in glum could be so drastically different from statsmodels? In this case, the glum model coefficients are far closer to 0 than the statsmodels coefficients. (This has had an effect on the standard error and confidence interval calculations I have drafted in another post.)

I have included a small replicable code example below, as well as the dataset used.

import pandas as pd # v1.5.2
import tabmat # v3.1.13

import glum #v2.6.0
import statsmodels.api as sm # v0.13.5
import statsmodels.formula.api as smf

    
# Read data.
filename = 'train_data_binomial.csv'
train_data = pd.read_csv(filename, header = 0)

# Fit glum model.
dtypes = {'Year': 'category',
          'Field16952': 'category',
          'Field16995': 'float64',
          'Field17024': 'float64',
          'Field17041': 'float64',
          'Field17041_e4': 'category',
          'Field17045': 'float64',
          }
pred_cols = [key for key in dtypes.keys()]
glum_train_data = train_data.astype(dtypes) # change 3 vars to categorical
glum_model = glum.GeneralizedLinearRegressor(family = glum.BinomialDistribution(), link = 'log')
is_GLM = True
X = tabmat.from_pandas(df = glum_train_data[pred_cols], drop_first = is_GLM)
glum_model.fit(X, train_data['Response'], store_covariance_matrix = True)

# Fit statsmodels model.
# References are base levels for categorical features.
formula = 'Response~C(Year, Treatment(reference=7.0))'
formula += '+C(Field16952, Treatment(reference=21.0))'
formula += '+Field16995+Field17024+Field17041'          # all numeric here
formula += '+C(Field17041_e4, Treatment(reference=1.0))+Field17045'
sm_fam = sm.families.Binomial(sm.families.links.log())
sm_model = smf.glm(formula, train_data, family = sm_fam).fit()

# Show coef values.
glum_params = pd.concat([pd.Series(glum_model.intercept_), pd.Series(glum_model.coef_)])
sm_params = sm_model.params
glum_params.index = sm_params.index
print(glum_params)
print(sm_params)

Requirements:

glum==v2.6.0
pandas==1.5.2
statsmodels==0.13.5
tabmat==3.1.13

train_data_binomial.csv (updated)

@jtilly
Copy link
Member

jtilly commented Mar 5, 2024

Please set alpha=0 to fit an unregularized GLM with glum. The default is alpha=1. Also, see the discussion in #730.

@jtilly
Copy link
Member

jtilly commented Mar 5, 2024

There are a few more issues with your example:

  • You need to align the base levels if you want to compare coefficients one-by-one. glum is dropping the first category.
  • You're using log-link in combination with a binomial distribution. Are you doing that on purpose? The more conventional choice is to use a logistic link. With binomial models with log link you can easily run into convergence issues.
  • If you want to match results one-for-one, you can tune the gradient tolerance.

E.g., with gradient_tol=1e-7, I get

                                                  glum         sm
Intercept                                     -4.036252 -4.036252
C(Year, Treatment(reference=1.0))[T.2]         0.200931  0.200931
C(Year, Treatment(reference=1.0))[T.3]         1.965884  1.965884
C(Year, Treatment(reference=1.0))[T.4]         2.393245  2.393245
C(Year, Treatment(reference=1.0))[T.5]         2.670647  2.670647
C(Year, Treatment(reference=1.0))[T.6]         2.663939  2.663939
C(Year, Treatment(reference=1.0))[T.7]         2.540882  2.540882
C(Year, Treatment(reference=1.0))[T.8]         2.447370  2.447370
C(Field16952, Treatment(reference=1.0))[T.2]  -0.223762 -0.223762
C(Field16952, Treatment(reference=1.0))[T.3]   0.376726  0.376726
C(Field16952, Treatment(reference=1.0))[T.4]   0.783549  0.783549
C(Field16952, Treatment(reference=1.0))[T.5]   0.302100  0.302100
C(Field16952, Treatment(reference=1.0))[T.6]   0.651024  0.651024
C(Field16952, Treatment(reference=1.0))[T.7]  -0.351390 -0.351390
C(Field16952, Treatment(reference=1.0))[T.8]   1.067860  1.067860
C(Field16952, Treatment(reference=1.0))[T.9]   0.148517  0.148517
C(Field16952, Treatment(reference=1.0))[T.10] -0.566334 -0.566334
C(Field16952, Treatment(reference=1.0))[T.11]  0.239491  0.239491
C(Field16952, Treatment(reference=1.0))[T.12]  0.401606  0.401606
C(Field16952, Treatment(reference=1.0))[T.13]  0.844306  0.844306
C(Field16952, Treatment(reference=1.0))[T.14]  0.733198  0.733198
C(Field16952, Treatment(reference=1.0))[T.15]  1.005536  1.005536
C(Field16952, Treatment(reference=1.0))[T.16]  0.450197  0.450197
C(Field16952, Treatment(reference=1.0))[T.17]  0.381430  0.381430
C(Field16952, Treatment(reference=1.0))[T.18]  0.565800  0.565800
C(Field16952, Treatment(reference=1.0))[T.19] -0.980566 -0.980566
C(Field16952, Treatment(reference=1.0))[T.20]  0.465809  0.465809
C(Field16952, Treatment(reference=1.0))[T.21]  0.530756  0.530756
Field16995                                    -0.188947 -0.188947
Field17024                                    -0.105579 -0.105579
Field17041                                    -0.102587 -0.102587
Field17045                                    -0.005137 -0.005137
import pandas as pd # v1.5.2
import tabmat # v3.1.13

import glum #v2.6.0
import statsmodels.api as sm # v0.13.5
import statsmodels.formula.api as smf

    
# Read data.
filename = 'train_data_binomial.csv'
train_data = pd.read_csv(filename, header = 0)

# Fit glum model.
dtypes = {'Year': 'category',
          'Field16952': 'category',
          'Field16995': 'float64',
          'Field17024': 'float64',
          'Field17041': 'float64',
          'Field17045': 'float64',
          }
pred_cols = [key for key in dtypes.keys()]
glum_train_data = train_data.astype(dtypes) # change 3 vars to categorical
glum_model = glum.GeneralizedLinearRegressor(family = glum.BinomialDistribution(), alpha = 0, gradient_tol=1e-7)
is_GLM = True
X = tabmat.from_pandas(df = glum_train_data[pred_cols], drop_first = is_GLM)
glum_model.fit(X, train_data['Response'], store_covariance_matrix = True)

# Fit statsmodels model.
# References are base levels for categorical features.
formula = "Response~C(Year, Treatment(reference=1.0))"
formula += "+C(Field16952, Treatment(reference=1.0))"
formula += "+Field16995+Field17024+Field17041"  # all numeric here
formula += "+Field17045"
sm_fam = sm.families.Binomial()
sm_model = smf.glm(formula, train_data, family = sm_fam).fit()

# Show coef values.
glum_params = pd.concat([pd.Series(glum_model.intercept_), pd.Series(glum_model.coef_)])
sm_params = sm_model.params
glum_params.index = sm_params.index
print(pd.concat([glum_params, sm_params], axis=1))

print((glum_params-sm_params).abs().max())

@enriquecardenas24
Copy link
Author

enriquecardenas24 commented Mar 6, 2024

@jtilly, thank you for your expeditious reply, I understand your response fully and it has helped greatly.

I believe my problem then boils down to the problem of covariance parameters. I can open a new issue for this if necessary, as the comment that follows is only tangentially related. I have used the code below to return to a Tweedie problem, which is your code plus supplementary functions. Here, I obtain standard errors for the prediction means for both glum and statsmodels. These calculations involve covariance parameters of the models.

In the code, I first do a manual calculation method (derived from statsmodels) with the glum results. Then, I do the built-in calculation for statsmodels, using the summary_frame() method. Lastly, I do the manual method with statsmodels to prove that my manual calculation is correct. Please notice how the standard errors are different for glum, but they are the same for statsmodels regardless of which method is used (built-in or manual).

I don't believe it's absolutely essential to understand all details of the functions here; I want to place emphasis on the fact that the covariance parameters for each of glum and statsmodels varies, causing the standard error calculations to increase in glum. (Please note here that the covariance parameters are the only difference in each of the calculations.) Is there a reason that the covariance parameters are different for a glum model?

Edit: I have been drafting this standard error and confidence interval method for #756.

Details
import numpy as np # v1.26.3
import pandas as pd # v1.5.2
import tabmat # v3.1.13

import glum #v2.6.0
import statsmodels.api as sm # v0.13.5
import statsmodels.formula.api as smf
from scipy import stats # v1.12.0
from patsy import dmatrix
    

# Read data.
filename = 'train_data_tweedie.csv'
train_data = pd.read_csv(filename, header = 0)

# Parameters for both models.
tw_param = 1.6
yr_ref = 7

# Fit glum model.
data_year, data_response = 'Year', 'Response'
cat_dtypes = {data_year: 'category',
              'Field16952': 'category',
              'Field16995': 'float64',
              'Field17024': 'float64',
              'Field17041': 'float64',
              'Field17041_e4': 'category',
              'Field17045': 'float64',
              }
pred_cols = [key for key in cat_dtypes.keys()]
vars_no_yr = [var for var in pred_cols if var != data_year]
glum_train_data = train_data[pred_cols].astype(cat_dtypes) # change 3 vars to categorical
family = glum.TweedieDistribution(power = tw_param)
glum_model = glum.GeneralizedLinearRegressor(family = family, alpha = 0, gradient_tol = 1e-7)
is_GLM = True
X = tabmat.from_pandas(df = glum_train_data[pred_cols], drop_first = is_GLM)
glum_model.fit(X, train_data['Response'], store_covariance_matrix = True)

# Fit statsmodels model.
# References are base levels for categorical features.
formula = "Response~C(Year, Treatment(reference=1.0))"
formula += "+C(Field16952, Treatment(reference=1.0))"
formula += "+Field16995+Field17024+Field17041+C(Field17041_e4, Treatment(reference=1.0))"  # all numeric here
formula += "+Field17045"
sm_fam = sm.families.Tweedie(var_power = tw_param)
sm_model = smf.glm(formula, train_data, family = sm_fam).fit()

# Show coef values.
glum_params = pd.concat([pd.Series(glum_model.intercept_), pd.Series(glum_model.coef_)])
sm_params = sm_model.params
glum_params.index = sm_params.index # will be off by 1, b/c statsmodels doesn't keeep same order
#print(pd.concat([glum_params, sm_params], axis=1))
#print((glum_params-sm_params).abs().max())

# Do predictions.
def predict_w_CI(
        model: glum._glm.GeneralizedLinearRegressor,
        vars_no_yr: list[str],
        alpha: float = 0.05,
):
    """
    Get model predictions, errors, and confidence intervals for the given
    records.
    
    Parameters
    ----------
    model : glum._glm.GeneralizedLinearRegressor
        The fitted glum model.
    vars_no_yr : list
        List of variables in this model, excluding the year field.
    alpha : float, default=0.05
        Confidence level of the predictions.
    
    Returns
    -------
    dt_dedup : pandas.core.frame.DataFrame
        Predictions, errors, and confidence intervals for predictions of the
        given dataset. Duplicate records have been removed to save time
        (only looking at unique combinations of the given field values).
    
    """

    # Create deduplicated data.
    # Read in more data. This is the entire dataset. We want to predict on a
    # deduplicated version of this.
    data_all = pd.read_csv('input_data.csv', header = 0, dtype = cat_dtypes)
    all_vars = [data_year] + vars_no_yr
    get_cols = [data_response] + all_vars
    all_data = data_all[get_cols].astype({data_year: 'category'}).copy()
    # Drop duplicate rows. Can merge to rest of orig df later.
    yr_data, response_data = all_data[data_year], all_data[data_response]
    dt_dedup = data_all[vars_no_yr].drop_duplicates().copy()
    dedup_idx = dt_dedup.index
    # Add year and response data back after dropping duplicates.
    dt_dedup.insert(0, data_year, yr_data.loc[dedup_idx])
    dt_dedup.insert(0, data_response, response_data.loc[dedup_idx])
    dt_dedup = dt_dedup.reset_index().drop('index', axis = 1)
    
    # Set exogs for var pred mean calculation.
    exog, res_cols, pred_exog = get_exog(dt_dedup[all_vars])
    covb = model.covariance_matrix_     # get covariance parameters
    preds = model.predict(pred_exog)    # calculate model predictions
    predicted_mean = np.array(preds)
    
    # Get prediction errors and CIs for each prediction.
    res = calc_SE_CIs(exog, covb, predicted_mean, alpha)
    
    return pd.concat([dt_dedup, res], axis = 1)


def get_exog(
        X_in: pd.core.frame.DataFrame,
        mod_type: str = 'glum',
):
    """
    Get the exog data to be input into prediction for standard errors and
    confidence intervals.
    
    Parameters
    ----------
    X_in : pandas.core.frame.DataFrame
        Input dataset to create exog data from.
    
    Returns
    -------
    exog : numpy.ndarray
        Input prediction dataset in array format, with ordered dummy columns
        for categorical fields.
    res_cols : list[str]
        Column names of the dataframe version of the exog array.
    pred_exog : numpy.ndarray
        Prediction dataset. Only used in pred_w_CIs. Same as exog data, except
        the base year has been set to 1 for all records.
    
    """
    
    # Set exog dataframe with dummy columns for categoricals.
    all_vars = X_in.columns.tolist()
    pred_cols_no_excl = all_vars.copy()
    [pred_cols_no_excl.remove(c) for c in all_vars if '_e' in c]
    exog_df = pd.get_dummies(X_in, drop_first = is_GLM)
    
    # Sort columns so the dataframe is in the right order.
    dummy_cols = exog_df.columns.tolist()
    yr_cols = [c for c in dummy_cols if data_year in c]
    # Remove inforce year.
    yr_cols = yr_cols[:-1]
    non_yr = [c for c in dummy_cols if data_year not in c]
    non_yr = [[c1 for c1 in non_yr if c2 in c1] for c2 in pred_cols_no_excl]
    non_yr = [sorted(c, key = len) for c in non_yr if c != []]
    non_yr = [x for xs in non_yr for x in xs]
    res_cols = yr_cols + non_yr
    exog_df = exog_df[res_cols]
    
    # Set years to 0 to remove effect.
    exog_df[yr_cols] = 0
    # Set base year values.
    pred_exog_df = exog_df.copy()
    pred_exog_df[data_year + '_' + str(yr_ref)] = 1
    pred_exog = np.atleast_2d(np.asarray(pred_exog_df))
    # Create exog used for error and CIs.
    exog_df.insert(0, 'intercept', 1)
    exog = np.atleast_2d(np.asarray(exog_df))
    
    return exog, res_cols, pred_exog


def calc_SE_CIs(
        exog: np.ndarray,
        covb: np.array,
        predicted_mean: np.array,
        alpha: float,
):
    """
    Do calculations for standard errors and confidence intervals.
    
    Parameters
    ----------
    exog : numpy.ndarray
        Input dataset to do calculations on.
    covb : numpy.array
        Model covariance parameters.
    predicted_mean : numpy.array
        Predicted means for the data given in exog.
    alpha : float
        Alpha value (confidence level).
    
    Returns
    -------
    res : pandas.core.frame.DataFrame
        Results dataframe with mean predictions, standard errors, and
        confidence intervals.
    
    """
    
    # Calculate variance and standard errors of the predicted mean.
    var_pred_mean = (exog * np.dot(covb, exog.T).T).sum(1)
    se = np.sqrt(var_pred_mean)                 # obs?
    # Get the confidence intervals using log of predicted mean and se.
    predicted_mean = np.log(predicted_mean)     # only for log link?
    q = stats.norm.ppf(1 - alpha / 2.0)   # dist_args?
    lower = predicted_mean - q * se
    upper = predicted_mean + q * se
    ci_mean = np.column_stack((lower, upper))
    # Reset predicted mean, standard errors, CIs to exponentials.
    predicted_mean = np.exp(predicted_mean)     # only for log link?
    se_mean = np.sqrt(predicted_mean**2 * var_pred_mean)
    ci = np.exp(ci_mean)                        # only for log link?
    # Create dataframe of results.
    res = pd.DataFrame()
    res['mean'] = predicted_mean
    res['mean_se'] = se_mean
    res['mean_ci_lower'] = ci[:, 0]
    res['mean_ci_upper'] = ci[:, 1]
    
    return res


# glum predictions
dt_dedup = predict_w_CI(glum_model, vars_no_yr)
print('\nglum values:\n', dt_dedup[['mean', 'mean_se', 'mean_ci_lower', 'mean_ci_upper']])

# statsmodels predictions
dt_dedup_sm = dt_dedup[pred_cols].copy()
dt_dedup_sm[data_year] = yr_ref  # set year to reference
num_dtypes = {data_year: 'int64',   # need to do categoricals as int input
              'Field16952': 'int64',
              'Field16995': 'float64',
              'Field17024': 'float64',
              'Field17041': 'float64',
              'Field17041_e4': 'int64',
              'Field17045': 'float64',
              }
dt_dedup_sm = dt_dedup_sm.astype(num_dtypes)
summary_frame = sm_model.get_prediction(dt_dedup_sm).summary_frame(alpha = 0.05)
print('\nstatsmodels values (summary frame fn):\n', summary_frame)
# Same as manual calculation:
exog = np.atleast_2d(np.asarray(dmatrix(sm_model.model.data.design_info, dt_dedup_sm.copy())))
covb = np.array(sm_model.cov_params())
predicted_mean = np.array(summary_frame['mean'])
res = calc_SE_CIs(exog, covb, predicted_mean, alpha = 0.05)
print('\nstatsmodels values (manual calculation):\n', res)

# Results are different because covariance parameters are different between models.
print('\nglum cov params:\n', pd.DataFrame(glum_model.covariance_matrix_))
print('\statsmodels cov params:\n', pd.DataFrame(covb))   # covb from statsmodels above

New data files used:
train_data_tweedie.csv
input_data.csv

@jtilly
Copy link
Member

jtilly commented Mar 6, 2024

I haven't checked your code in detail, but it's expected that standard errors between statsmodels and glum don't quite line up, we even have tests for it:

Poisson / Normal / Logit:

glum/tests/glm/test_glm.py

Lines 1887 to 1894 in d65ac0d

# Here, statsmodels does not do a degree of freedom adjustment,
# so we manually add it for nonrobust and robust errors
# nonrobust
ourse = mdl.std_errors(X=X, y=y, dispersion=dispersion, robust=False)
corr = len(y) / (len(y) - X.shape[1] - int(fit_intercept))
smse = mdl_sm.fit(cov_type="nonrobust").bse * np.sqrt(corr)
np.testing.assert_allclose(ourse, smse, rtol=1e-4)

@enriquecardenas24
Copy link
Author

@jtilly I see. No need to worry about the details too much for the code. I think the main part is the calc_SE_CIs() function, where I use the dataset and covariance parameters (model.covariance_matrix_) to calculate variance and standard errors of the prediction means.

Is it expected then that the standard errors of glum are typically greater than the statsmodels standard errors? I can see this is largely the case after plotting some of the data at the end of the previous code:

import seaborn as sns
col_name = 'Ratio of glum SEs to statsmodels SEs'
ratio = pd.DataFrame(dt_dedup['mean_se'] / res['mean_se'])
ratio = ratio.rename(columns = {'mean_se': col_name})
sns.histplot(data = ratio, x = col_name)

image

@jtilly
Copy link
Member

jtilly commented Mar 7, 2024

Is it expected then that the standard errors of glum are typically greater than the statsmodels standard errors?

Yes, as you an see in our test suite, we're doing a degree of freedom correction and statsmodels isn't.

@enriquecardenas24
Copy link
Author

@jtilly right, thanks. In that case, neither method is necessarily wrong, correct? Does utilizing degrees of freedom benefit the standard error calculation in a way that the statsmodels method doesn't by increasing the accuracy of the calculations?

@MatthiasSchmidtblaicherQC
Copy link
Contributor

MatthiasSchmidtblaicherQC commented Mar 8, 2024

I am closing this because the original question has been answered. For general statistics question such as "Why would one correct for degrees of freedom", there are excellent resources elsewhere such as Stackoverflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants