A repo on how to perform multiple linear regression analysis. I have used a sample heart disease data that analyses the relationship between heart disease, biking and smoking.
- Get the summary of the heart.data dataset to check that it has been read correctly.
summary(head.data)
- Independence of Observations - Using cors() function to check the relationship between our independent variables.
cor(heart.data$biking, heart.data$smoking)
- Normality - Using the hist() function to test whether our dependent variable follows a normal distribution.
hist(heart.data$heart.disease)
- Linearity - Checking the two scatterplots both the biking and heart disease, and one for smoking and heart disease.
plot(heart.disease ~ biking, data=heart.data)
plot(heart.disease ~ smoking, data=heart.data)
- Checking if there's a linear relationship between biking to work, smoking, and heart disease in our imaginary survey of 500 towns.
heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = heart.data)
summary(heart.disease.lm)
- Before proceeding with data visualization, we need to ensure that our models fit the homoscedasticity assumption of the linear model.
par(mfrow=c(2,2))
plot(heart.disease.lm)
par(mfrow=c(1,1))
- Plotting the relationship between biking and heart disease at different levels of smoking. Smoking will be treated as a factor with three levels, just for the purposes of displaying the relationships in our data.
- Creating a new dataframe with the information needed to plot the model - This will not create anything new in your console, but you should see a new data frame appear in the Environment tab. Click on it to view it.
plotting.data<-expand.grid(
biking = seq(min(heart.data$biking), max(heart.data$biking), length.out=30),
smoking=c(min(heart.data$smoking), mean(heart.data$smoking), max(heart.data$smoking)))
- Predicting the values of heart disease based on our linear model - Saving our ‘predicted y’ values as a new column in the dataset we've created
plotting.data$predicted.y <- predict.lm(heart.disease.lm, newdata=plotting.data)
- Rounding the smoking numbers to two decimal values - This will make the legend easier to read later on.
plotting.data$smoking <- round(plotting.data$smoking, digits = 2)
- Changing the smoking variable into a factor - This allows us to plot the interaction between biking and heart disease at each of the three levels of smoking we chose.
plotting.data$smoking <- as.factor(plotting.data$smoking)
- Plotting the original data
install.packages("ggplot2")
then run
library(ggplot2)
then lastly
heart.plot <- ggplot(heart.data, aes(x=biking, y=heart.disease)) +
geom_point()
heart.plot
- Adding the regression lines
heart.plot <- heart.plot +
geom_line(data=plotting.data, aes(x=biking, y=predicted.y, color=smoking), size=1.25)
heart.plot
- Making the graph ready for publication
heart.plot <-
heart.plot +
theme_bw() +
labs(title = "Rates of heart disease (% of population) \n as a function of biking to work and smoking",
x = "Biking to work (% of population)",
y = "Heart disease (% of population)",
color = "Smoking \n (% of population)")
heart.plot
** Adding our regression model to the graph
heart.plot + annotate(geom="text", x=30, y=1.75, label=" = 15 + (-0.2*biking) + (0.178*smoking)")
- In our survey of 500 towns, we found significant relationships between the frequency of biking to work and the frequency of heart disease and the frequency of smoking and frequency of heart disease (p < 0 and p < 0.001, respectively). Specifically we found a 0.2% decrease (± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a 0.178% increase (± 0.0035) in the frequency of heart disease for every 1% increase in smoking.