-
Notifications
You must be signed in to change notification settings - Fork 0
/
life_expectancy_prediction.Rmd
657 lines (510 loc) · 28.4 KB
/
life_expectancy_prediction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
---
title: "Life Expectancy Prediction"
author: Vipul Gharde, Chaitanya Bachhav
output: pdf_document
date: "2022-12-05"
bibliography: references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Abstract
We have implemented a Linear Regression model to predict the life expectancy of the human population with our best model having an adjusted R-squared value of 0.8239, RMSE of 3.71, and MAE of 2.85, from the processed dataset having ~1650 observations of ~20 variables related to life expectancy and health factors for 193 countries provided by the Global Health Observatory (GHO) data repository under the World Health Organization (WHO). Several model building techniques including Forward Selection, Backward Elimination, and Stepwise Regression were used to obtain the candidate models, which were then evaluated with K-Fold Cross Validation to yield the model with the lowest RMSE value. Our best model passes the normality assumption and has no issues with the multicollinearity of the variables.
# Introduction
Life expectancy is an estimate of the expected average number of years of life (or a person's age at death) for individuals who were born into a particular population. It is one of the most used summary indicators for the overall health of a population. Its levels and trends direct health policies, and researchers try to identify the determining risk factors to assess and forecast future developments. [@luy2020life]
The goal of this project is to build a Linear Regression model that can predict the life expectancy of the human population based on several factors such as the average Body Mass Index (BMI), the Gross Domestic Product (GDP) of a country, the amount of alcohol consumption in a country, immunization of various vaccines among 1-year-olds such as Hepatitis B, Polio, and Diphtheria vaccines, and more, and also derive insights into what factors are significant in determining a higher or lower life expectancy of the human population.
# Materials and Methods
## Software and Packages Used
We have used R with the RStudio Integrated Development Environment for our analysis and for building the Linear Regression models. We have also used the R packages `corrplot`, `ggplot`, `car`, `olsrr`, and `caret` to aid in our analysis and model building.
## Dataset
The data related to life expectancy and health factors for 193 countries is taken from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO). Its corresponding economic data was collected from the United Nations website for a period of 16 years (2000-2015).
The dataset is available at https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who.
```{r, include=FALSE}
library(corrplot)
library(ggplot2)
library(car)
library(olsrr)
library(caret)
```
\newpage
```{r, include=FALSE}
life = read.csv('Life Expectancy Data.csv') # Load Dataset
life.master = life # For Backup
```
`Life Expectancy Data.csv` contains the following fields:
+ `Country` - Country Observed.
+ `Year` - Year Observed.
+ `Status` - Developed or Developing status.
+ `Life.expectancy` - Life Expectancy in age.
+ `Adult.Mortality` - Adult Mortality Rates on both sexes (probability of dying between 15-60 years/1000 population).
+ `infant.deaths` - Number of Infant Deaths per 1000 population.
+ `Alcohol` - Alcohol recorded per capita (15+) consumption (in litres of pure alcohol).
+ `percentage.expenditure` - Expenditure on health as a percentage of Gross Domestic Product per capita (%).
+ `Hepatitis.B` - Hepatitis B (HepB) immunization coverage among 1-year-olds (%).
+ `Measles` - Number of reported Measles cases per 1000 population.
+ `BMI` - Average Body Mass Index of entire population.
+ `under.five.deaths` - Number of under-five deaths per 1000 population.
+ `Polio` - Polio (Pol3) immunization coverage among 1-year-olds (%).
+ `Total.expenditure` - General government expenditure on health as a percentage of total government expenditure (%).
+ `Diphtheria` - Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%).
+ `HIV.AIDS` - Deaths per 1000 live births due to HIV/AIDS (0-4 years).
+ `GDP` - Gross Domestic Product per capita (in USD).
+ `Population` - Population of the country.
+ `thinness..1.19.years` - Prevalence of thinness among children and adolescents for Age 10 to 19 (%).
+ `thinness.5.9.years` - Prevalence of thinness among children for Age 5 to 9 (%).
+ `Income.composition.of.resources` - Human Development Index in terms of income composition of resources (index ranging from 0 to 1).
+ `Schooling` - Number of years of Schooling (years).
In total, there are 2938 observations of 22 variables with 20 of them being numerical and 2 categorical (`Country` and `Status`).
We are using `Life.expectancy` to predict the life expectancy of the human population with the given independent variables in the dataset.
For data cleaning, we have dropped any observation that does not contain any value in any of its columns. This shrinks our dataset to 1649 observations.
```{r, include=FALSE}
life = na.omit(life)
```
We have plotted a boxplot and a histogram for all the numerical variables in the dataset. For categorical variables, we have plotted a barplot indicating the counts of each category of the variable. This can be viewed in Appendix A of the report.
## Feature Selection
We have removed some of the variables for building the model due to the reasons mentioned below:
`Country` - Contains too many levels with no additional information to predict `Life.expectancy`.\
`Year` - Contains time series data with no additional information to predict `Life.expectancy`.
```{r, include=FALSE}
life = life[, !(names(life) %in% c('Country', 'Year'))]
```
We have also mutated `Hepatitis.B`, `Polio`, and `Diphtheria` for building the model since the range between their minimum values and their 1st Quartiles are too wide. We have mutated their values into 2 categorical values: '<90% Covered' and '>=90% Covered'.
```{r, include=FALSE}
life$Hepatitis.B = ifelse(life$Hepatitis.B < 90, '<90% Covered', '>=90% Covered')
life$Polio = ifelse(life$Polio < 90, '<90% Covered', '>=90% Covered')
life$Diphtheria = ifelse(life$Diphtheria < 90, '<90% Covered', '>=90% Covered')
```
This leaves us with 1649 observations of 20 variables with 16 of them being numerical and 4 categorical (`Status`, `Hepatitis.B`, `Polio`, and `Diphtheria`).\
```{r}
summary(life)
```
\newpage
## Correlations
We would want to look at the correlation matrix to see what variables are correlated with the target variable, and also to check if any independent variable is also correlated with another independent variable. Correlated independent variables in a model can negatively impact its performance, and so if any such pair of independent correlated variables is found, we would keep only one of them in the model.
Since the number of variables is moderately large, we have plotted the correlation plot of the dataset rather than printing the correlation matrix by itself. The colors and their shades easily guide us to show what 2 variables are correlated.
```{r, echo=FALSE}
life_nums = unlist(lapply(life, is.numeric), use.names = FALSE)
corrplot(
cor(life[, life_nums]),
method = 'number',
tl.cex = 0.5,
number.cex = 0.33,
cl.cex = 0.5
)
```
There are a few takeaways from this correlation plot:
* `Life.expectancy` has a strong positive correlation with `Income.composition.of.resources` and `Schooling`.
* `Life.expectancy` has a negative correlation with `Adult.Mortality`, which makes sense since if the adult mortality rate is high, then obviously the life expectancy will be low.
* `Life.expectancy` has a very weak correlation with `Measles` and `Population`.
* There is a very strong correlation between `infant.deaths` and `under.five.deaths`, `percentage.expenditure` and `GDP`, and `thinness..1.19.years` and `thinness.5.9.years`, indicating multicollinearity between them. Therefore, we have removed `under.five.deaths`, `percentage.expenditure`, and `thinness.5.9.years` for building the model.
```{r, include=FALSE}
life = life[, !(
names(life) %in% c(
'under.five.deaths',
'percentage.expenditure',
'thinness.5.9.years'
)
)]
```
\newpage
It is evident from the scatterplot of `Life.expectancy` against `Income.composition.of.resources` and `Life.expectancy` against `Schooling` that there is a positive trend in the life expectancy of the human population with the increase in the values of these independent variables.
```{r, echo=FALSE}
ggplot(data = life,
aes(x = Income.composition.of.resources, y = Life.expectancy)) + geom_point() + geom_smooth(formula = y ~ x, method = 'lm')
ggplot(data = life, aes(x = Schooling, y = Life.expectancy)) + geom_point() +
geom_smooth(formula = y ~ x, method = 'lm')
```
\newpage
It is evident from the scatterplot of `Life.expectancy` against `Adult.Mortality` that there is a negative trend in the life expectancy of the human population with the increase in the value of this independent variable.
```{r, echo=FALSE}
ggplot(data = life, aes(x = Adult.Mortality, y = Life.expectancy)) + geom_point() +
geom_smooth(formula = y ~ x, method = 'lm')
```
\newpage
It seems in the scatterplot of `Life.expectancy` against `Measles` and `Life.expectancy` against `Population` that there is a negative trend in the life expectancy of the human population with the increase in the values of these independent variables, but since the bulk of the data falls on the lower range of their values, there exist some high leverage and high influence points that appear to drive the regression line downward toward the negative. This is clearly evident in the scatterplot of `Life.expectancy` against `Population`.
```{r, echo=FALSE}
ggplot(data = life, aes(x = Measles, y = Life.expectancy)) + geom_point() +
geom_smooth(formula = y ~ x, method = 'lm')
ggplot(data = life, aes(x = Population, y = Life.expectancy)) + geom_point() +
geom_smooth(formula = y ~ x, method = 'lm')
```
\newpage
## Model Building
We now build a Linear Regression Model using all the remaining variables to predict the life expectancy of the human population. We will set the level of $\alpha$ to be 0.05 throughout the analysis.
```{r}
lmod_all = lm(Life.expectancy ~ ., data = life)
summary(lmod_all)
```
There are a few takeaways from this full model:
* The p-value of the model is 2.2e-16 < 0.05, indicating that it is significant.
* The adjusted R-squared value of the model is 0.8243, indicating that about 82.43% of the variability observed in the target variable can be explained by the independent variables in the model, which is quite a good result and can possibly be improved even further with model selection.
* `Adult.Mortality`, `Alcohol`, `BMI`, `HIV.AIDS`, `GDP`, `Income.composition.of.resources` and `Schooling` are the most significant variables with p-value < 0.05.
* From the model we can interpret that `Income.composition.of.resources` has a strong positive effect on life expectancy.
* From the model we can interpret that `StatusDeveloping`, `Adult.Mortality`, `infant.deaths`, `Alcohol`, `HIV.AIDS`, and `thinness..1.19.years` may have a negative effect on life expectancy.
* A peculiar result we can interpret from the model is that `Hepatitis.B>=90% Covered` may also have a negative effect on life expectancy.
We now generate models by using different techniques like Forward Selection method, Backward Elimination method and Stepwise Regression method.
### Forward Selection Method
The Forward Selection method involves building a model starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.
For building the model using Forward Selection method, we have used the default p-to-enter value of 0.3.
```{r}
ols_step_forward_p(lmod_all)
```
The Forward Selection method included all the variables in the model.
### Backward Elimination Method
The Backward Elimination method involves building a model starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically significant loss of fit.
For building the model using Backward Elimination method, we have used the default p-to-remove value of 0.3.
```{r}
ols_step_backward_p(lmod_all)
```
The Backward Elimination method did not eliminate any variables from the model.
\newpage
### Stepwise Regression Method
The Stepwise Regression method is a combination of the above two methods, starting with no variables in the model and testing at each step for variables to be included or excluded.
For building the model using Stepwise Regression method, we have used the default p-to-enter value of 0.1 and the default p-to-remove value of 0.3.
```{r}
ols_step_both_p(lmod_all)
```
A total of 12 variables were included in the model built using Stepwise Regression method.
\newpage
In summary, the variables chosen by the methods are indicated in the following table (x denotes the variable was chosen by the corresponding method):
| Model Selection Method | `Status` | `Adult.Mortality` | `infant.deaths` | `Alcohol` |
|-----------------------:|:--------:|:-----------------:|:---------------:|:---------:|
| Forward Selection | x | x | x | x |
| Backward Elimination | x | x | x | x |
| Stepwise Regression | x | x | | x |
| Model Selection Method | `Hepatitis.B` | `Measles` | `BMI` | `Polio` | `Total.expenditure` |
|-----------------------:|:-------------:|:---------:|:-----:|:-------:|:-------------------:|
| Forward Selection | x | x | x | x | x |
| Backward Elimination | x | x | x | x | x |
| Stepwise Regression | x | | x | | x |
| Model Selection Method | `Diphtheria` | `HIV.AIDS` | `GDP` | `Population` |
|-----------------------:|:------------:|:----------:|:-----:|:------------:|
| Forward Selection | x | x | x | x |
| Backward Elimination | x | x | x | x |
| Stepwise Regression | x | x | x | |
| Model Selection Method | `thinness..1.19.years` | `Income.compostition.of.resources` | `Schooling` |
|-----------------------:|:----------------------:|:----------------------------------:|:-----------:|
| Forward Selection | x | x | x |
| Backward Elimination | x | x | x |
| Stepwise Regression | x | x | x |
Both the Forward Selection method and Backward Elimination method have chosen the same set of variables.
### K-Fold Cross Validation
We are left with 2 models - the full model and the model built using Stepwise Regression method - as our candidate models. To find out which model is the better one to pick as our final model, we ran K-Fold Cross Validation on both models and subsequently picked the model with the lowest mean RMSE value as our final model. We chose the value of K to be 5.
In K-Fold Cross-Validation, the original sample of the dataset is randomly partitioned into K equal sized subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K-1 subsamples are used as training data. The cross-validation process is then repeated K times, with each of the K subsamples used exactly once as the validation data. The K results are then averaged to produce a single estimation, which in our case is the mean RMSE value.
```{r}
# Define training control
set.seed(13245)
train.control = trainControl(method = 'cv', number = 5)
```
Cross-Validation for the full model:
```{r}
# Train the model
CV_all = train(
Life.expectancy ~ .,
data = life,
method = 'lm',
trControl = train.control
)
```
\newpage
```{r}
# Summarize the results
CV_all
```
Cross-Validation for model chosen by Stepwise Regression method:
```{r}
# Train the model
CV_stepwise = train(
Life.expectancy ~ Schooling + HIV.AIDS + Adult.Mortality + Income.composition.of.resources + BMI + GDP + Diphtheria + Alcohol + thinness..1.19.years + Status + Hepatitis.B + Total.expenditure,
data = life,
method = 'lm',
trControl = train.control
)
```
```{r}
# Summarize the results
CV_stepwise
```
Results of 5-Fold Cross-Validation on the 2 models:
```{r}
rbind(CV_all$results, CV_stepwise$results)
```
Since the model chosen by Stepwise Regression method `lmod_stepwise` has a lower RMSE value, we have selected this model to be our final model.
# Results
Out of the 2 candidate models, we have picked the model chosen by the Stepwise Regression method to be our final model. This decision was based on the fact that the Stepwise Regression model had the lowest mean RMSE value when evaluated with K-Fold Cross Validation (K = 5).
```{r}
lmod_final = lm(
Life.expectancy ~ Schooling + HIV.AIDS + Adult.Mortality + Income.composition.of.resources + BMI + GDP + Diphtheria + Alcohol + thinness..1.19.years + Status + Hepatitis.B + Total.expenditure,
data = life
)
summary(lmod_final)
```
The final model contains 12 variables: `Schooling`, `HIV.AIDS`, `Adult.Mortality`, `Income.composition.of.resources`, `BMI`, `GDP`, `Diphtheria>=90% Covered`, `Alcohol`, `thinness..1.19.years`, `Status`, `Hepatitis.B>=90% Covered`, and `Total.expenditure`.
\newpage
There are a few takeaways from this final model:
* The p-value of the model is 2.2e-16 < 0.05, indicating that it is significant.
* The adjusted R-squared value of the model is 0.8239, indicating that about 82.39% of the variability observed in the target variable can be explained by the independent variables in the model.
* Pretty much all the variables in the model are the most significant variables with p-value < 0.05.
* From the model we can interpret that `Income.composition.of.resources` has a strong positive effect on life expectancy.
* From the model we can interpret that `HIV.AIDS`, `Adult.Mortality`, `Alcohol`, `thinness..1.19.years`, and `StatusDeveloping`, may have a negative effect on life expectancy.
* `Hepatitis.B>=90% Covered` may also have a negative effect on life expectancy, the same observation we had seen previously on the full model.
## Model Error Estimation
We have primarily used the R-squared, the adjusted R-squared, the root-mean-square error (RMSE), and the mean absolute error (MAE) as the metric for evaluating our models.
The estimates for the final model are derived from the results of K-Fold Cross Validation (K = 5), and are summarized in the following table:
| Metric | Estimate |
|----------:|:---------:|
| R-squared | 0.8219568 |
| RMSE | 3.711009 |
| MAE | 2.846318 |
## Model Adequacy Checking
To make sure our final model behaves as expected in terms of inference and prediction, we need to test that the assumptions made in building the Linear Regression model are not broken. These assumptions are:
1. The relationship between the response $y$ and the regressors is linear, at least approximately.
2. The error term $\epsilon$ has zero mean.
3. The error term $\epsilon$ has constant variance $\sigma^2$.
4. The errors are uncorrelated.
5. The errors are normally distributed. [@montgomery2021introduction]
We will be testing these assumptions by checking whether the residuals of the model are normally distributed, whether the plot of the residuals against the fitted values show any pattern, and whether there are any anomalies in the Q-Q plot.
```{r, echo=FALSE}
hist(lmod_final$residuals)
```
Most of the residuals seem to be distributed in the center, indicating that they are distributed normally.
```{r, echo=FALSE}
plot(lmod_final, which = c(1:2))
```
There is no obvious observable pattern in the above plots, and so we conclude that our final model is appropriate.
We also run a multicollinearity test to see if there is any multicollinearity between the variables in the model. We have used the Variance Inflation Factor (VIF) to determine the multicollinearity between the variables.
```{r}
vif(lmod_final)
```
A VIF > 10 implies serious problems with multicollinearity.\
Since the VIF for all of the predictors is less than 10, there seems to be no issue with multicollinearity.
\newpage
# Discussion
We have implemented a Linear Regression model that predicts the life expectancy of the human population using 12 variables related to life expectancy and health factors. We have shown that several of these independent variables have some correlations with life expectancy, suggesting that fitting a Linear Regression model to this dataset would be appropriate. Our model also passes the normality assumption and has no issues with the multicollinearity of the variables.
While all the independent variables in our model are significant, the inferences made from this model should be taken with a grain of salt - we do not believe that an increase in Hepatitis B (HepB) immunization coverage among 1-year-olds should be associated with a declining life expectancy in a country. This calls for further investigation.
We had planned to include the All Possible Regressions method and the Best Subsets Regression method for building our candidate models, but since the computational time for running these methods was too high on our machines, we have omitted these methods in our analysis.
Due to the limited amount of time for this analysis, we have not explored working with Polynomial Regression and including interaction terms in our models. We believe that in particular, a squared term for `BMI` in the model would be appropriate since it is well-known that too low or too high of a BMI poses health risks for individuals. It is also possible to convert this numerical variable into categorical values, for example, 'Underweight' for BMI < 18.5, 'Healthy' for 18.5 $\le$ BMI < 25, 'Overweight' 25 $\le$ BMI < 30, and 'Obese' for BMI $\ge$ 30.
We are curious to know how our model compares to other regression techniques such as KNN Regression, Support Vector Regression, Decision Tree Regression, and also Random Forest Regression. We have left this for future work.
Though we believe that the variables we have in our dataset are relevant in predicting the life expectancy of the human population, we think that having more relevant variables like sex, exercise, smoking, and environment pollution would be even more helpful for inference and prediction. It is also always beneficial to collect more observations of data related to life expectancy and health factors for building our model.
# Literature Cited
<div id="refs"></div>
\newpage
# Appendix A - Variable Distribution Plots
```{r, echo=FALSE}
barplot(
table(life.master$Country),
main = 'Barplot of Country',
xlab = 'Country',
ylab = 'Frequency',
cex.names = 0.2,
las = 2
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$Year, main = 'Year', ylab = 'Year')
hist(life.master$Year, main = 'Histogram of Year', xlab = 'Year')
```
```{r, echo=FALSE}
barplot(
table(life.master$Status),
main = 'Barplot of Status',
xlab = 'Status',
ylab = 'Frequency'
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$Life.expectancy,
main = 'Life Expectancy',
ylab = 'Age')
hist(life.master$Life.expectancy,
main = 'Histogram of Life Expectancy',
xlab = 'Age')
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$Adult.Mortality,
main = 'Adult Mortality Rate',
ylab = 'Mortality Rates on Both Sexes')
hist(
life.master$Adult.Mortality,
main = 'Histogram of Adult Mortality Rate',
xlab = 'Mortality Rates on Both Sexes',
cex.main = 0.9
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$infant.deaths, main = 'Infant Deaths', ylab = 'Deaths / 1000 Population')
hist(life.master$infant.deaths, main = 'Histogram of Infant Deaths', xlab = 'Deaths / 1000 Population')
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$Alcohol, main = 'Alcohol Consumption', ylab = 'Litres')
hist(
life.master$Alcohol,
main = 'Histogram of Alcohol Consumption',
xlab = 'Litres',
cex.main = 0.9
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$percentage.expenditure,
main = 'Health Expenditure',
ylab = '% of GDP per Capita')
hist(
life.master$percentage.expenditure,
main = 'Histogram of Health Expenditure',
xlab = '% of GDP per Capita',
cex.main = 0.9
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(
life.master$Hepatitis.B,
main = 'Hepatitis B (HepB) Immunization',
ylab = '% Coverage Among 1-Year-Olds',
cex.main = 0.9
)
hist(
life.master$Hepatitis.B,
main = 'Histogram of HepB Immunization',
xlab = '% Coverage Among 1-Year-Olds',
cex.main = 0.9
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$Measles, main = 'Measles', ylab = 'Reported Cases / 1000 Population')
hist(life.master$Measles, main = 'Histogram of Measles', xlab = 'Reported Cases / 1000 Population')
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$BMI, main = 'Average BMI', ylab = 'BMI')
hist(life.master$BMI, main = 'Histogram of Average BMI', xlab = 'BMI')
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$under.five.deaths,
main = 'Under-Five Deaths',
ylab = 'Deaths / 1000 Population')
hist(
life.master$under.five.deaths,
main = 'Histogram of Under-Five Deaths',
xlab = 'Deaths / 1000 Population',
cex.main = 0.9
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$Polio, main = 'Polio (Pol3) Immunization', ylab = '% Coverage Among 1-Year-Olds')
hist(
life.master$Polio,
main = 'Histogram of Pol3 Immunization',
xlab = '% Coverage Among 1-Year-Olds',
cex.main = 0.9
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(
life.master$Total.expenditure,
main = 'General Government Health Expenditure',
ylab =
'% of Total Government Expenditure',
cex.main = 0.8
)
hist(
life.master$Total.expenditure,
main = 'Histogram of General Government Health Expenditure',
xlab =
'% of Total Government Expenditure',
cex.main = 0.7
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$Diphtheria, main = 'DTP3 Immunization', ylab = '% Coverage Among 1-Year-Olds')
hist(
life.master$Diphtheria,
main = 'Histogram of DTP3 Immunization',
xlab = '% Coverage Among 1-Year-Olds',
cex.main = 0.9
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$HIV.AIDS, main = 'HIV/AIDS (0-4 Years)', ylab = 'Deaths / 1000 Live Births')
hist(
life.master$HIV.AIDS,
main = 'Histogram of HIV/AIDS (0-4 Years)',
xlab = 'Deaths / 1000 Live Births',
cex.main = 0.9
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$GDP, main = 'GDP per Capita (in USD)', ylab = 'GDP per Capita (in USD)')
hist(life.master$GDP,
main = 'Histogram of GDP per Capita (in USD)',
xlab = 'GDP per Capita (in USD)',
cex.main = 0.9)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$Population, main = 'Country Population', ylab = 'Country Population')
hist(
life.master$Population,
main = 'Histogram of Country Population',
xlab = 'Country Population',
cex.main = 0.9
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(
life.master$thinness..1.19.years,
main = 'Prevalence of Thinness (10-19 Years)',
ylab = '%',
cex.main = 0.9
)
hist(
life.master$thinness..1.19.years,
main = 'Histogram of Prevalence of Thinness (10-19 Years)',
xlab = '%',
cex.main = 0.7
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(
life.master$thinness.5.9.years,
main = 'Prevalence of Thinness (5-9 Years)',
ylab = '%',
cex.main = 0.9
)
hist(
life.master$thinness.5.9.years,
main = 'Histogram of Prevalence of Thinness (5-9 Years)',
xlab =
'%',
cex.main = 0.7
)
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$Income.composition.of.resources,
main = 'HDI',
ylab = 'HDI')
hist(life.master$Income.composition.of.resources,
main = 'Histogram of HDI',
xlab = 'HDI')
```
```{r, echo=FALSE}
par(mfrow = c(1, 2))
boxplot(life.master$Schooling, main = 'Schooling', ylab = 'Years')
hist(life.master$Schooling, main = 'Histogram of Schooling', xlab = 'Years')
```