-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.qmd
573 lines (368 loc) · 24.8 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
---
title: "Generalized Linear Model: Using Count Data"
subtitle: Learning how to use Generalized Linear models in R
title-block-banner: true
date: "`r Sys.Date()`"
author:
- name: Ana Bravo & Tendai Gwanzura
affiliations:
- Florida International University
- Robert Stempel College of Public Health and Social Work
toc: true
number-sections: true
highlight-style: pygments
format:
html:
code-fold: true
html-math-method: katex
---
## Libraries Used
```{r}
#| message: false
library(katex) # renders math equations
library(gt) # publication ready tables
library(gtsummary) # extension to publication ready tables
library(devtools) # helps with package development
library(pscl) # maximum likelihood estimation of zero-inflated model for count
# library(MEPS) # library for MEPS data, outdated for R version 4.3.0
library(MASS) # fits negative binomial models
library(tidyverse) # for wrangling data
```
## Introduction
In this presentation, we will be discussing how to use the Generalized Linear Model (GLM) method using count data in the R programming language. We will show how to clean and wrangle the data, show necessary columns to execute our method, discuss assumptions held, showcase the code used to run this GLM, and the interpretation of our output results.
## What is a GLM?
### lets start with a quick overview of the simple linear regression
As you know, a simple linear regression has two major components: a Y, dependent outcome and an X, which is your independent or your predictor variable. the model looks something like this:
$$Y_i = \beta_0 + \beta_1X_i + \epsilon$$ where $\beta_0$ is your intercept and $\beta_1$ is your slope. A linear model is a function, that us used to fit a data. We often use this method to see the association, or strength in association between two variables of interest:
where:
- $Y_i$ is your dependent variable of interest
- $\beta_0$ is your constant terms
- $\beta_1$ is your slope coefficient
- and $X_i$ is your independent variable
- $\epsilon$ is your random error that cannot be explained.
```{r}
#| message: false
# load data:
data(trees)
#rename variables:
names(trees) <- c("DBH_in","height_ft", "volume_ft3")
# simple model:
model <- lm(DBH_in ~ height_ft, data = trees)
# plot:
simple_trees <-
#data
ggplot(data = trees) +
# x and y
aes(x = height_ft, y = DBH_in) +
#labels
labs(title = "Example of Simple Association - Using Trees Data",
x = "Height in feet",
y = "Diameter in inches") +
# add points
geom_point() +
# add the lm
geom_smooth(method = "lm", color = "blue", se = FALSE) +
# add a simple theme
theme_bw()
# actual plot:
simple_trees
```
Often you will find this model writen in this form:
$$ E(Y|X) = \beta_0 + \beta_1 + \varepsilon$$ where $\beta_0$ and $\beta_1$ are our coefficients that need to be estimated and $\varepsilon$, or the error term, is used for more complex lines. Our **goal** is to see the association between an outcome with an exposure.
### Assumptions of a linear equation
Before diving into the generalized models, lets quickly overview the assumptions of linear regression.
- Assumption 1: for each combination of independent variable x, y is a random variable with a [certain probability distribution.]{style="background-color:yellow"}
- Assumption 2: Y values are statistical dependent.
- Assumption 3: the mean value of Y, for each specific combination of values of X, is a linear function.
- Assumption 4: the variance of Y, is the same for any fixed value of $X_n$
- Assumption 5: for any fixed combination of $X_n$, Y is normally distributed.
Something to note, that in the simple explanation above we are assuming our Y variable (remember one of the points we talked above above? that y holds a specific distribution!) is continuous. So lets talk about our Y variable having a count distribution.
### lets talk about the generalized linear model
the term "generalized" is a big umbrella term used to describe a large class of models. Our response variable $y_i$ is following an exponential family distribution with a mean of $u_i$ which is sometimes non-linear! However [McCallagh and Nelder](https://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf) considered them to be linear because our covariate affect the distribution of $y_i$ only through linear combination.
there are three major components of a GLM:
- A Random component: which specifies the probability distribution of our response variable
- Systematic component: specifies the explanatory variable $(x_1 .. x_n)$ in our model. Or their linear combination $(\beta_0 + \beta_1)$ etc.
- a link function: specifies the LINK between the random and systematic components. This helps us figure out our expected values of the response is related to the linear combination of our explanatory variables. for example the link function $g(u) = u$, which is called an identity function. this models the mean directly. Or, in the case we will talk about today the log of the mean:
$$ log(\pmb{\mu}) = \alpha + \beta_1x_1 + ... B_nx_n + \epsilon$$ Generalized linear function also follows certain assumptions:
- the data points are $Y_1, Y_2, ..., Y_n$ are independently distributed (cases are independent)
- the dependent variable $Y_i$ does not need to be normally distributed. but it typically assumes a distribution from a family such as binomial, Poisson, multinomial, normal.
- GLM does not assume linear relationship between the response variable and explanatory variable but it does assume a linear relationship between the transformed expected response in terms of your link function.
- Homogeneity of variance does not need to be satisfied.
- Explanatory variables can be nonlinear transformation of some original variable
- Parameter estimation uses maximum likelihood estimates (MLE rather than ordinary least squares)
### What is a Poisson Regression?
a Poisson regression models how the mean of a discrete (or we can say count too!) response variable $Y$ depends on our explanatory $X$ variables. Here is a simple look at the Poisson regression:
$$ log \lambda_i = \beta_0 + \beta x_i $$ where the random component: the distribution of $Y$ is the mean of $\lambda$ and the systematic component is the explanatory variable (or your X variables, which can be continuous or categorical) that is linearly associated. Or can be transformed if non-linear, and the link function is the log link stated in the section above.
this Poisson distribution follows a formula that looks as followed:
$$ P(X = x) {\lambda^x e^{-\lambda} \over x!}$$ where
- $P(X = x)$ probability of (x) occurrences in a given interval
- $\lambda$ mean number of occurrences during interval
- x is the number of occurrences desired
- $e$ is the base of natural logarithm
An advantage of using GLM over a normal line model is the link function gives us more flexibility in modeling and this model uses the Maximum likelihood estimate. Additionally we can use different inference tools like Wald's test for logistic and Poisson models.
### important points of Poisson models
- we can use this type of link function by using count data. count data can be describes at the number of devices that can access the internet, number of sex partners you have in your lifetime, and the number of individuals infected with a disease.
- Poisson is unimodel and skewed to the right, both the mean and the variance are the same. In other words, when the count is larger it tends to be more varied.
- if our mew increases the skew decreases and the distribution starts to become more bell shapped. our mean tends to look something like this:
$$\pmb{\mu} = exp(\alpha + \beta x) = e^\alpha (e^\beta)^x $$ where one unit increase in X has a multiplicative impact on your $e^\beta$ power on the mean. (More on this a little later in the interpretation section!)
### A few last points before we move on
Like with different models in statistics, one must follow the assumptions of a regression, and yes, Poisson has them as well. The assumptions for a Poisson regression are as follows:
- Your Y (dependent values) must be count values.
- Count values must have a positive integer (or who numbers) $(0, 1, 2, 3, ... k).$ Negative values will not work here.
- explanatory variables can be continuous, dichotomous or ordinal
- Our count variable must follow a **Poisson distribution** that is, the [mean and variance should be the same.]{style="background-color:yellow"} remember this one!
$$ \pmb{\mu} = E(X) = \lambda $$ $$ \sigma^2 = \lambda $$
when your modeling count data, the link scale is linear. So the effects are additive on the link. While your response scale is nonlinear (this is on the exponent) and so the effects are *multiplicative.* makes sense? we will work out an example now!
## Example 1: using GLM for count responses - Using the MEPS data set for 2020
Lets import the built in `National Health Care Survery` data set from the Medical Expenditure Panel Survey website located [here.](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-224) the code book can also be located [here](https://meps.ahrq.gov/data_stats/download_data/pufs/h224/h224cb.pdf) for the 2020 Full year Consolidate data file. the MEPS is the medical expenditure panel survey data, which is a large scale survey about families and individual medical coverage across the United States.
### import MEPS data set
```{r}
#| message: false
#| eval: false
# clean
HC2020_clean <- read_csv("/Users/anabravo/Documents/GitHub/R-health-blog/HC2020_clean.csv")
# Load data from AHRQ MEPS website
hc2020 = read_MEPS(file = "h224")
# lets name a copy of this data set so i don't ruin it
hc2020_2 <- hc2020
# these variables are upper case, lets turn them lower case
names(hc2020_2) <- tolower(names(hc2020_2))
# theres about 14,000 variables. lets keep the ones we want to look at
hc2020_subset <- hc2020_2 |>
select(dupersid,obdrv20, ertot20, rthlth31, adpain42, region31, age20x, racev1x, sex, marry31x, educyr, faminc20, empst31, mcare20)
# cleaning names:--------------------------------------------------------------
# dupersid = Person ID
# obdrv20 = number of office based physician visits in 2020
# ertot20 = emergency rooms visits in 2020
# rthlth31 = perceived health status
# adpain41 = pain level
# region31 = Census region: Northest, midwest, south and west
# age20x = Age as of Dec 2020
# racev1x = race: white, black, American indiain, Asian, multiple races
# sex = sex at birth
# marry31x = marital status
# educyr = year of ed when entered in MEPS
# faminc20 = family total income
# empst31 = employment status
# mcare20 = Covered by medicare? Yes/no
# removing old names because that sub header in columns messes me up sometimes ---
names(hc2020_subset) <- NULL
new_names <- c("Person_ID",
"Nubr_office_visits",
"Nubr_emergency_visits",
"Health_status",
"Pain_Level",
"Region",
"Age_as_Dec2020",
"Race",
"Is_male",
"Is_married",
"Education_lvl",
"Total_fam_incom",
"Is_employed",
"Covered_by_MediCar")
names(hc2020_subset) <- new_names
# this data set has codes for NA, will switch to NA in R ------------------------
HC2020_clean <- HC2020_clean |>
mutate(Pain_Level = na_if(Pain_Level, -15)) |>
mutate(Health_status = na_if(Health_status, -8)) |>
mutate(Region = na_if(Region, -1)) |>
mutate(Age_as_Dec2020 = na_if(Age_as_Dec2020, -1)) |>
mutate(Is_married = na_if(Is_married,-8 )) |>
mutate(Is_employed = na_if(Is_employed, -15)) |>
mutate(Education_lvl = na_if(Education_lvl, -15)) |>
mutate(Covered_by_MediCar = na_if(Covered_by_MediCar, -1))
# also, for ease of interpretation i will dichotomize some variables ------------
HC2020_clean <- HC2020_clean |>
mutate(Is_married = ifelse(Is_married == 1, 1, 0)) |>
mutate(Is_employed = ifelse(Is_employed == 1, 1, 0)) |>
mutate(Is_male = ifelse(Is_male == 1 , 1, 0 )) |>
mutate(Covered_by_MediCar = ifelse(Covered_by_MediCar == 1, 1, 0))
# i also want to transform some variable using case when -----------------------
HC2020_clean <- HC2020_clean |>
mutate(Health_status = case_when(
Health_status == 1 ~ "Good",
Health_status == 2 ~ "Good",
Health_status == 3 ~ "Fair",
Health_status == 4 ~ "Fair",
Health_status == 5 ~ "Poor"
))
# clean HC2020_clean_2 --------------------------------------------------------
HC2020_clean_2 <- HC2020_clean_2 |>
# change na values
mutate(Pain_Level = na_if(Pain_Level, -15)) |>
mutate(Health_status = na_if(Health_status, -8)) |>
mutate(Region = na_if(Region, -1)) |>
mutate(Age_as_Dec2020 = na_if(Age_as_Dec2020, -1)) |>
mutate(Is_married = na_if(Is_married,-8 )) |>
mutate(Is_employed = na_if(Is_employed, -15)) |>
mutate(Education_lvl = na_if(Education_lvl, -15)) |>
mutate(Covered_by_MediCar = na_if(Covered_by_MediCar, -1)) |>
# change married status to binary
mutate(Is_married = ifelse(Is_married == 1, 1, 0)) |>
mutate(Is_employed = ifelse(Is_employed == 1, 1, 0)) |>
mutate(Is_male = ifelse(Is_male == 1 , 1, 0 )) |>
mutate(Covered_by_MediCar = ifelse(Covered_by_MediCar == 1, 1, 0)) |>
#change healthoutcome categorical for ease of inerpretation
mutate(Health_status = case_when(
Health_status == 1 ~ "Good",
Health_status == 2 ~ "Good",
Health_status == 3 ~ "Fair",
Health_status == 4 ~ "Fair",
Health_status == 5 ~ "Poor"
))
# write this data set ---------------------------------------------------------
write_csv(
HC2020_clean_2,
file = "/Users/anabravo/Documents/GitHub/R-health-blog/HC2020_clean_final.csv"
)
```
let's say i want to take a sneak peak at this data set. I will use the `gtsummary` package to look at the first levels of the data set, and I just want to very simply make this data set interactive by using the `opt_interactive()` function in the `gt` package.
```{r}
#| message: false
HC2020_clean_2 <- read_csv("/Users/anabravo/Documents/GitHub/R-health-blog/HC2020_clean_final.csv")
HC2020_clean_2 |>
select(Person_ID, Nubr_office_visits, Health_status, Pain_Level, Region, Is_employed, Is_married, Is_employed, Is_male) |>
#head(10) |>
gt() |>
tab_header(
title = "Brief view of HC 2020 data set",
subtitle = "2020 Data from MEPS website"
) |>
opt_interactive(
use_search = TRUE,
use_highlight = TRUE,
use_compact_mode = TRUE,
use_resizers = TRUE,
use_text_wrapping = FALSE,
pagination_type = "jump"
)
```
## Some quick data exploration
Remember what I said earlier about the mean and variance being the same? EDA is important because it might tell us more details about our variable of interest. For this particular question, we might be interested in the number of times an event has occurred. In this case, the number of times someone pay a visit to an office doctor. This data follows everyone for exactly one year.
- I am interested in seeing the number of physician office visits in the last year (we can also call this our Y variable)
- We are interested in seeing how self-perceived health status (Good, Fair, or Poor), gender (is male, 1 = male, 0 = female), and marital status (married = 0, not married/other categories = 0) is associated with number of doctor visits (our X variables).
Because we are looking at count data, we might find it important to check out the mean and variance of our Y variables. In the case that our Y variable mean and variance are not the same, we might consider some sort of *transformation.* Transforming a data set can help with the normality of it.
### lets look at the outcome variable of interest
```{r}
mean(HC2020_clean_2$Nubr_office_visits, na.rm = TRUE)
```
the mean of this data set is about 2.79
```{r}
var(HC2020_clean_2$Nubr_office_visits, na.rm = TRUE)
```
and the variance is about 34.59. We might need to do something with this particular data set but as an example, lets first run a Generalized Linear Model without doing any kind of transformation to the data.
### Checking the distribution of the data
```{r}
#| message: false
hist(
x = HC2020_clean_2$Nubr_office_visits,
main = "Distrubtion of Number of Office visits",
xlab = "Office visits within a year"
)
```
I might also just be interested in the general responses of our y variable of interest:
```{r}
# I'm just interested in the frequency
HC2020_clean_2 |>
group_by(Nubr_office_visits) |>
summarise(n = n()) |>
mutate(Freq = n / sum(n), Freq = scales::percent(Freq)) |>
gt() |>
opt_interactive(
use_search = TRUE,
use_highlight = TRUE,
use_compact_mode = TRUE,
use_resizers = TRUE,
use_text_wrapping = FALSE,
pagination_type = "jump"
)
```
We might need to check if we have over dispersion in this model. A quick way to check is by diving the residual deviance by our degrees of freedom. IF this value is greater than one, then we can use probably use a negative binomial distribution.
## How to build a GLM with a Poisson Distribution
The basic core functions of a GLM look something like this: `glm(counts ~ outcome + treatment, data = data_set, family = "poisson")` which includes:
- `counts:` would be your Y variable
- `data:` where your pulling the data from
- `~:` or an equal sign
- `outcome + treatment:` are your X variables
- `family:` the link or distribution you are following
*We are interested in measuring number of visits to a doctors office as a function of perceived health, sex assigned at birth, and marital status.*
```{r}
#| code-fold: false
# build first model
PossionModel1 <- glm(Nubr_office_visits ~ Health_status + Is_male + Is_married,
data = HC2020_clean_2, family = "poisson")
# check summary
summary(PossionModel1)
```
this summary gives us the null deviance, which is the total sum of squares, and the residual deviance which is the sum of square errors or the unexplained deviance. We also get the AIC for this model (although AIC is more useful when comparing to other model fits) and we also get the model coefficients for Health status, sex assigned at birth, and marital status. We also get an AIC score of `199315.`
lets divide the residual deviance by the degrees of freedom.
```{r}
#| message: false
PossionModel1$deviance / PossionModel1$df.residual
```
the value we got from this was 5.56. this value is very big and so we should consider a negative binomial regression.
Additionally, this output is extremely ugly so we will feed this model into `gtsummary` in a little bit so looks a little better.
### Modeling Possion Model 1 with GT summary
```{r}
#| message: true
PossionModel1 |>
tbl_regression(exponentiate = TRUE) |>
bold_levels() |>
bold_p(t = .1)
```
## Interpreting Rate Ratio
For this particular data set, we are interested in looking at the number of doctor visits over a year. the coefficient for $\beta_1$ = -0.579 and is statistically significant. Meaning that perceived health influence the rate of doctors visits. And because it is negative we can figure out by how much. So in other words $e^{-.579}$ = 0.56. This is our rate ratio, the multiplicative increase on doctors visits for those perceived as good health when compared to fair. Because out $\beta$ is \< 1, then perceived good health has a protective factor and we can interpret this as:
[**Perceived good health is associated with a 44% reduction (0.56 - 1 = -0.44) in paid doctors visits in a year.**]{style="background-color:yellow"}
Lets interpret perceived poor health status when compared to fair. the coefficient for this $\beta$ = 0.74. and is statistically significant. And because it is positive we can figure out by how much and the direction. So in other words $e^{.74})$ = 2.12.
[**In other words, those who perceived themselves as in poor health when compared to those perceived in fair health have a 2.12 times more doctors visit. Or we can also say, that perceived poor health is associated with an increase of 112% (2.12 - 1 = 112) increase in doctors visits.** ]{style="background-color:yellow"}
## How to acount for overdispersion
Remember how we spoke about earlier that this particular model we are fitting is most likely over dispersed? Well, we should try to account for that. Over dispersion means that we may have to much variation in an Poisson regression model. some ways to mediate this is by:
- selecting a different random component distribution that can account for this (like negative binomial)
- or use a nonlinear and generalize linear mixed model to handle the random effects.
in this case we will go for negative binomial distribution.
## Negative Binomial Distribution
A negative binomial regression can be used for over-dispersed count data, like mentioned, when the variance exceed the mean. This can be considered like a generalization of the Poisson regression, because it has the same mean structure as Poisson regression but it has an extra parameter to model over-dispersion! Also because of the, the confidence intervals for a negative binomial may be a bit wider when compared to a normal Poisson regression model.
> *a note about Zero-inflated regressio models: this type of model attempt to account for to many zeros. Zero-inflated models estimate two models at the same time, one for count model and one for excess zeros.*
> *we can also compare our model to a zero inflated model to check for fit:*
```{r}
#| eval: true
#| message: false
#this is prob a better fit the negative B
PoissonModel2_NB <- glm.nb(Nubr_office_visits ~ Health_status + Is_male + Is_married, data = HC2020_clean_2)
# fit model with log link
PoissonModel3_ZI <- zeroinfl(Nubr_office_visits ~ Health_status + Is_male + Is_married, data = HC2020_clean_2, dist = "poisson")
#summary analysis
summary(PoissonModel3_ZI)
# comparing AIC score of ZB and NB
AIC(PoissonModel2_NB, PoissonModel3_ZI)
```
as suspected, the model `PoissonModel2_NB` negative binomial, seems to be a better fit.
Let's go with a negative binomial regression analysis. In this case we will need to utilize the `MASS` package to estimate negative binomial regression parameters and account for the dispersion in this data set.
```{r}
PoissonModel2_NB <- glm.nb(Nubr_office_visits ~ Health_status + Is_male + Is_married, data = HC2020_clean_2)
summary(PoissonModel2_NB)
```
lets check the dispersion of this negative binomial model
```{r}
PoissonModel2_NB$deviance / PoissonModel2_NB$df.residual
```
1.00 is much better than 5. Also the AIC value for this model is `112172` which is much better than the original model fit we used on model 1. lets fit this model into GT summary:
```{r}
PoissonModel2_NB |>
tbl_regression(exponentiate = TRUE) |>
bold_levels() |>
bold_p(t = .1)
```
## Interpretting Males and those who are married in this data set
Our reference group is 0, so for variables `Is_male` and `Is_married` the reference group in this case would be females and not married. When your fitting a model, it is very important to double check what is your reference group because different statistical software may default to different reference groups.
- **Is_male:** $e^{-0.32}$ = 0.72, when comparing males to females, identifying as male is associated with a reduction of 28% (0.72 - 1 = -0.28) in doctors visits.
- **Is_married:** $e^{.32}$ = 1.38, when compared married to non-married individuals, being married is associated with a 38% increase (1.38 - 1 = .38) in doctors visit. Or, the married group has a 1.38 times more doctors visit when compared to non-married.
## Conclusion
Today you learned of the important of linear regression, the benefits of a generalized linear model, and its assumptions what is a Poisson regression and the assumption that need to be satisfied to run this model, the importance of a Poisson regression, how to build a GLM with a Poisson link, how the clean data from MEPS, how to visualize some counts for this data, how to account for over dispersion and how to interpret rate ratio.
## References
1. McCullagh, P., and J. A. Nelder. 2019. "An Outline of Generalized Linear Models," January, 21--47. [https://doi.org/10.1201/9780203753736-2](https://www.taylorfrancis.com/chapters/mono/10.1201/9780203753736-2/outline-generalized-linear-models-mccullagh-nelder)
2. Nelder, J. A., and R. W. M. Wedderburn. 1972. "Generalized Linear Models." Journal of the Royal Statistical Society. Series A (General) 135 (3): 370. [https://doi.org/10.2307/2344614.](https://www.jstor.org/stable/2344614?origin=crossref)
3. Guyen, M. (2023). "A Guide on Data Analysis". Made on Bookdown. <https://bookdown.org/mike/data_analysis/poisson-regression.html>
4. Bruin, J. (2011). Negative Bonomial Regression, R Dad Analysis Examples. [https://stats.oarc.ucla.edu/stata/ado/analysis/](https://stats.oarc.ucla.edu/r/dae/negative-binomial-regression/)
5. Introduction to GLMs, STAT 504 Analysis of Discrete Data. Pennsylvania State University <https://online.stat.psu.edu/stat504/lesson/6/6.1>