Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Going through tutorial #7

Closed
Spaak opened this issue Dec 7, 2023 · 2 comments
Closed

Going through tutorial #7

Spaak opened this issue Dec 7, 2023 · 2 comments

Comments

@Spaak
Copy link

Spaak commented Dec 7, 2023

(As mentioned before, I'm a reviewer for JOSS, see openjournals/joss-reviews#6037)

I'm going through the tutorial here https://jeancmaia.github.io/posts/tutorial-mcglm/tutorial_mcglm.html and wanted to highlight a few things.

  • At the smoking data example, with the simple linear regression, modelresults.pearson_residuals is all NaNs. At the step before that, you emphasize checking these residuals, so a natural thing would be to also do that here. Likely relevant warning:
[/opt/homebrew/Caskroom/miniconda/base/envs/mcglm3/lib/python3.9/site-packages/mcglm/mcglmcattr.py:719](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/mcglm3/lib/python3.9/site-packages/mcglm/mcglmcattr.py:719): RuntimeWarning: invalid value encountered in sqrt
  sqrt_mu_power = np.sqrt(mu_power)
[/opt/homebrew/Caskroom/miniconda/base/envs/mcglm3/lib/python3.9/site-packages/mcglm/mcglmcattr.py:724](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniconda/base/envs/mcglm3/lib/python3.9/site-packages/mcglm/mcglmcattr.py:724): RuntimeWarning: invalid value encountered in log
  n=n_len, values=((mu_power * np.log(mu)) [/](https://file+.vscode-resource.vscode-cdn.net/) (2 * sqrt_mu_power))
  • Is the pAIC metric always the one to look at when comparing different models? (A full treatment of different goodness-of-fit models is beyond the scope, of course, but I'd recommend at least briefly describing and reflecting on pAIC if that's the one you use.) (And what's the difference between AIC (I know that one) and pAIC?)
  • With the smoking data example, it would help to explain what is the difference when going to the Tweedie variance model (which you specify in the code); how does that relate to Poisson (which you describe in the text). Also, what is again the difference when going to poisson_tweedie?
  • Some more details would be helpful on the construction of the Z matrices in the autoregressive example. What are these, and what is mc_ma doing for us exactly? (The documentation of mc_ma is not sufficient; also the naming could be a bit more informative.)
  • In the last example (soya data), it is not clear how the multivariate model (grain, seeds, and viablePeas as dependent variables) is different from separate univariate models for each of the DVs. I think the primary value in MCGLM models (over existing generalized LM and mixed model approaches) is in properly handling multivariate responses. Therefore, this would be important to highlight also in the tutorial.
@jeancmaia
Copy link
Owner

jeancmaia commented Dec 8, 2023

Hey, Spaak. Thanks for the comprehensive feedback. I'll run through each topic according to your sorting list.

  1. Nice catch! Owing to a warning suppression, I couldn't perceive those issues. A smidgen bug wrapped $mu$'s variance calculation when the variance function is constant. The new library version '0.2.3' holds this bugfix.

  2. Amazing question! Information Criterion(IC) metrics, such as Akaike and Bayesian, are commonly used for model comparison. They rely on the Likelihood function, which doesn't hold thoroughly for the MCGLM model owing to the second-moment assumptions. Therefore, MCGLM derives goodness-of-fit-like metrics dubbed pseudo-Akaike and Bayesian information, gearing up the Gaussian pseudo log-likelihood function proposed by (Carey and Wang). Picking between pAIC and pBIC is a matter of the problem and its context, as they differ on penalty terms and should be picked accordingly.

  3. Tweedie distribution is a fabulous agnostic model that behaves like some exponential family distribution, conditioned by the power parameter. When power is 1, it operates as a Poisson model. https://en.wikipedia.org/wiki/Tweedie_distribution. Tweedie and Poisson Tweedie have different variance functions and behave distinctly; the article to Joss elaborates on it.

  4. Z matrices work similarly to the working correlation matrix in Generalized Estimation Equations. They reflect the covariance structure within an outcome variable. The original paper elaborates on how many canonical statistical models are reachable from an adequate specification of Z matrices, such as Moving Average and Mixed models. Auxiliary methods mc_ma and mc_mixed help one produce those matrices for the statistical analysis. Nevertheless, one can input those matrices as she wishes. I have polished the mc_ma docstring method.

  5. The key feature of the multivariate case is the correlation coefficients among outcomes, conveyed in the report's last section: The rho parameters.

We could enhance the Joss article with some of this discussion. Thank you!

@Spaak
Copy link
Author

Spaak commented Dec 20, 2023

Thanks for the clear responses!

Pinging @AJQuinn (JOSS editor) here. I'm not entirely sure how JOSS reviews proceed after something like this. The tutorial is part of the documentation, and I gave some suggestions to improve it. Can I safely move on to the rest of the review checklist, or should I aim to check whether the tutorial and paper have indeed improved after I leave such comments? (I'd prefer moving on and leaving a second/third/etc check to either the editor or a re-review, if it comes to that. Otherwise I'm worried the review process will take ages.) Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants