Joseph Burks, Sandali Wijeratne, Thin Han 2024-05-07
The data set is from 538 a news website owned by ABC News with a focus on opinion poll analysis, economics and politics. The data set utilized is a collection of various demographic and economic information per state, such as percentage of no white citizens, percentage of non citizens and median incomes, as wells as, hate crime statistics. The dataset can be obtained here: https://github.com/fivethirtyeight/data/tree/master/hate-crimes
library(tidyverse)
library(factoextra)
library(factoextra)
library(cluster)
hate_crimes <- readr::read_csv("C:/Users/josep/Downloads/hate_crimes.csv")
head(hate_crimes)
## # A tibble: 6 × 12
## state median_household_inc…¹ share_unemployed_sea…² share_population_in_…³
## <chr> <dbl> <dbl> <dbl>
## 1 Alabama 42278 0.06 0.64
## 2 Alaska 67629 0.064 0.63
## 3 Arizona 49254 0.063 0.9
## 4 Arkansas 44922 0.052 0.69
## 5 Californ… 60487 0.059 0.97
## 6 Colorado 60940 0.04 0.8
## # ℹ abbreviated names: ¹median_household_income, ²share_unemployed_seasonal,
## # ³share_population_in_metro_areas
## # ℹ 8 more variables: share_population_with_high_school_degree <dbl>,
## # share_non_citizen <dbl>, share_white_poverty <dbl>, gini_index <dbl>,
## # share_non_white <dbl>, share_voters_voted_trump <dbl>,
## # hate_crimes_per_100k_splc <dbl>, avg_hatecrimes_per_100k_fbi <dbl>
summary_stats <- summarytools::descr(hate_crimes, round.digits = 2, transpose = TRUE)
summarytools::view(summary_stats, method = "render")
## Non-numerical variable(s) ignored: state
Mean | Std.Dev | Min | Q1 | Median | Q3 | Max | MAD | IQR | CV | Skewness | SE.Skewness | Kurtosis | N.Valid | Pct.Valid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
avg_hatecrimes_per_100k_fbi | 2.37 | 1.71 | 0.27 | 1.28 | 1.99 | 3.20 | 10.95 | 1.37 | 1.89 | 0.72 | 2.49 | 0.34 | 10.08 | 50 | 98.04 |
gini_index | 0.45 | 0.02 | 0.42 | 0.44 | 0.45 | 0.47 | 0.53 | 0.02 | 0.03 | 0.05 | 0.95 | 0.33 | 2.13 | 51 | 100.00 |
hate_crimes_per_100k_splc | 0.30 | 0.25 | 0.07 | 0.14 | 0.23 | 0.36 | 1.52 | 0.15 | 0.21 | 0.83 | 2.64 | 0.35 | 9.29 | 47 | 92.16 |
median_household_income | 55223.61 | 9208.48 | 35521.00 | 48060.00 | 54916.00 | 60730.00 | 76165.00 | 8619.84 | 12062.00 | 0.17 | 0.19 | 0.33 | -0.58 | 51 | 100.00 |
share_non_citizen | 0.05 | 0.03 | 0.01 | 0.03 | 0.04 | 0.08 | 0.13 | 0.02 | 0.05 | 0.57 | 0.56 | 0.34 | -0.77 | 48 | 94.12 |
share_non_white | 0.32 | 0.16 | 0.06 | 0.19 | 0.28 | 0.42 | 0.81 | 0.16 | 0.22 | 0.52 | 0.66 | 0.33 | 0.05 | 51 | 100.00 |
share_population_in_metro_areas | 0.75 | 0.18 | 0.31 | 0.63 | 0.79 | 0.90 | 1.00 | 0.19 | 0.26 | 0.24 | -0.62 | 0.33 | -0.45 | 51 | 100.00 |
share_population_with_high_school_degree | 0.87 | 0.03 | 0.80 | 0.84 | 0.87 | 0.90 | 0.92 | 0.04 | 0.06 | 0.04 | -0.37 | 0.33 | -1.13 | 51 | 100.00 |
share_unemployed_seasonal | 0.05 | 0.01 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.01 | 0.02 | 0.22 | 0.03 | 0.33 | -0.79 | 51 | 100.00 |
share_voters_voted_trump | 0.49 | 0.12 | 0.04 | 0.41 | 0.49 | 0.58 | 0.70 | 0.12 | 0.16 | 0.24 | -0.92 | 0.33 | 2.13 | 51 | 100.00 |
share_white_poverty | 0.09 | 0.02 | 0.04 | 0.07 | 0.09 | 0.10 | 0.17 | 0.01 | 0.02 | 0.27 | 0.61 | 0.33 | 0.68 | 51 | 100.00 |
Generated by summarytools 1.0.1 (R version 4.3.1)
2024-05-07
par(mfrow =c(3,3))
colnames <- dimnames(hate_crimes)[[2]]
for (i in 2:12) {
hist(as.numeric(unlist(hate_crimes[,i])), main=colnames[i], col="cyan", xlab = colnames[i])
}
for (i in 2:12){
print(ggplot(hate_crimes) +
aes(x = reorder(state, as.numeric(unlist(hate_crimes[,i]))), y = as.numeric(unlist(hate_crimes[,i]))) +
geom_bar(position="dodge",stat="identity",fill="blue",cex=0.75) +
coord_flip() +
labs(title = paste(colnames[i], "by State"),
x = "State", y = colnames[i]))
}
missing.values <- hate_crimes |>
gather(key = "key", value = "val") |>
mutate(is.missing = is.na(val)) |>
group_by(key, is.missing) |>
summarise(num.missing = n()) |>
filter(is.missing == T) |>
select(-is.missing) |>
arrange(desc(num.missing))
## `summarise()` has grouped output by 'key'. You can override using the `.groups`
## argument.
missing.values |>
ggplot() +
geom_bar(aes(x=key, y = num.missing), stat = "identity",fill = "orange") +
labs(x = "variable", y = "number of missing values", title="Number of missing values") +
theme(axis.text.x = element_text(angle = 45, hjust =1))
Luckily the data set does not contain many missing values, and in fact many seems to be the result of human error. 538’s given source for share of non citizens, Kaiser Family Foundation does contain the percentage for the non citizens for the missing states, which can be added to the data set. Similarly all the missing data for hate crimes per 100k from the Southern Poverty Law Center is reported, just that in the 4 states with missing data the Southern Poverty Law Center actually reported 0 instincances of hate crime. Although this is likely due to how there data collection relied on people reporting directly to them. The last missing data point is a result of Hawaii not sharing hate crime information with the FBI, meaning that the data is not missing completely at random. Since Hawaii is the only state with missing data, we decided to drop the state. We while also drop DC, since it is not a state and including it drastically changes the results.
no_na_crimes <- hate_crimes
no_na_crimes[20,6] <- 0.01
no_na_crimes[25,6] <- 0.01
no_na_crimes[42,6] <- 0.03
no_na_crimes[is.na(no_na_crimes)] <- 0
no_na_crimes[12,12] <- NA
no_na_crimes <- no_na_crimes[-c(9),]
missing.values2 <- no_na_crimes |>
gather(key = "key", value = "val") |>
mutate(is.missing = is.na(val)) |>
group_by(key, is.missing) |>
summarise(num.missing = n()) |>
filter(is.missing == T) |>
select(-is.missing) |>
arrange(desc(num.missing))
## `summarise()` has grouped output by 'key'. You can override using the `.groups`
## argument.
missing.values2
## # A tibble: 1 × 2
## # Groups: key [1]
## key num.missing
## <chr> <int>
## 1 avg_hatecrimes_per_100k_fbi 1
lm_crimes <- lm(avg_hatecrimes_per_100k_fbi ~ . -state , data = no_na_crimes)
summary(lm_crimes)
##
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ . - state, data = no_na_crimes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9975 -0.5946 0.1179 0.6704 2.4025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.224e+01 1.545e+01 -1.440 0.1581
## median_household_income 3.864e-05 4.113e-05 0.939 0.3535
## share_unemployed_seasonal 9.279e+00 2.249e+01 0.413 0.6822
## share_population_in_metro_areas -6.511e-01 1.558e+00 -0.418 0.6784
## share_population_with_high_school_degree 9.709e+00 1.145e+01 0.848 0.4019
## share_non_citizen 2.284e+01 1.010e+01 2.261 0.0295
## share_white_poverty 1.065e+01 1.362e+01 0.782 0.4392
## gini_index 2.414e+01 1.445e+01 1.670 0.1031
## share_non_white -3.846e+00 2.330e+00 -1.651 0.1071
## share_voters_voted_trump 3.173e+00 2.921e+00 1.086 0.2841
## hate_crimes_per_100k_splc 1.518e+00 1.153e+00 1.317 0.1956
##
## (Intercept)
## median_household_income
## share_unemployed_seasonal
## share_population_in_metro_areas
## share_population_with_high_school_degree
## share_non_citizen *
## share_white_poverty
## gini_index
## share_non_white
## share_voters_voted_trump
## hate_crimes_per_100k_splc
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.083 on 38 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3524, Adjusted R-squared: 0.1819
## F-statistic: 2.068 on 10 and 38 DF, p-value: 0.05261
The only significant predictor is share_non_citizen with a p-value 0.0295 and slope 22.84. The R^2 is around 0.3523 and was obtained from fitting hate crimes from fbi vs all other variables. This R^2 is very small, and performing stepwise variable selection, will only result in a regression model with a worse R^2.
plot(lm_crimes)
The qq plot looks mostly alright then is some deviation from the line in the right tail. The Residual plot does have a u shape, indicating that either the assumption of linearity or homoscedasticity was violated.
plot(avg_hatecrimes_per_100k_fbi ~ share_non_citizen, data = no_na_crimes)
There is no obvious pattern in the avg_hatecrimes_per_100k_fbi vs share_non_citizen so polynomial regression is would likely not improve upon linear regression. None of the variables seem like they would be good predictors of hate crimes, indicating that regression is likely not a good technique, at least with respect to hate crimes.
no_na_crimes <- drop_na(no_na_crimes)
scaled_crimes <- apply(no_na_crimes[,2:12], 2, scale)
crimes_cov <- cov(scaled_crimes)
crimes_eigen <- eigen(crimes_cov)
str(crimes_eigen)
## List of 2
## $ values : num [1:11] 3.757 3.342 1.388 0.799 0.481 ...
## $ vectors: num [1:11, 1:11] -0.0471 0.3417 0.4118 -0.3466 0.4089 ...
## - attr(*, "class")= chr "eigen"
According to Kaiser’s rule the first three principle components are enough since they are the only ones with eigenvalues greater than 1
phi <-crimes_eigen$vectors[,1:3]
colnames(phi) <- c("PC1","PC2","PC3")
pc1 <- scaled_crimes %*% phi[,1]
pc2 <- scaled_crimes %*% phi[,2]
pc3 <- scaled_crimes %*% phi[,3]
PC <- data.frame(State = no_na_crimes$state,pc1,pc2,pc3)# REMOVED 3RD PC
head(PC)
## State pc1 pc2 pc3
## 1 Alabama 0.447603881 -3.1780209 0.1609234
## 2 Alaska -0.797816329 1.2190309 -1.7994691
## 3 Arizona 2.376928173 -0.1527681 0.1960283
## 4 Arkansas 0.012088819 -2.6144792 -0.6038205
## 5 California 4.296824306 0.8709089 -0.3089248
## 6 Colorado -0.004830473 1.8753252 0.2237174
results <- princomp(scaled_crimes, fix_sign = FALSE)
fviz_pca_biplot(results)
The biggest contributions for the first principle component are positive contributions from the variables median_income, both hate crimes statistics and negative contributions from share_white_poverty. This indicates that the first principle component is a kind of measure hate crime rate and the average family economic status. The biggest contributions for the second component share_nonwhite, share_non_citizen, share_population_in_metro_areas, Together the first two principle components explain the nearly 65 percent of the variability in the data.
ggplot(PC, aes(pc1, pc2)) +
modelr::geom_ref_line(h = 0) +
modelr::geom_ref_line(v = 0) +
geom_text(aes(label = State), size = 3) +
xlab("First Principal Component") +
ylab("Second Principal Component") +
ggtitle("First Two Principal Components of USArrests Data")
West Virginia appears to have below household economic status and rate of hate crimes. It also has a below average share of non white citizens and non citizens. Maryland has above average economy and above average hate_crimes, percentage of non citizens and urbanization.
PVE <- crimes_eigen$values/sum(crimes_eigen$values)
round(PVE,3)
## [1] 0.342 0.304 0.126 0.073 0.044 0.032 0.027 0.021 0.012 0.011 0.009
crimes_eigen$values
## [1] 3.75668189 3.34231967 1.38757129 0.79936774 0.48120612 0.34985911
## [7] 0.30176244 0.23161315 0.13017449 0.12371693 0.09572716
The first 3 Principle components explain roughly 77 percent of the variability of the data and are the only pcs with eigenvalues greater than 1, so 3 principle components are enough to represent the data set while significantly reducing the dimensions.
hate_crimes <- read.csv("C:/Users/josep/Downloads/hate_crimes.csv",header=TRUE,row.names="state")
hate_crimes[20,5] <- 0.01
hate_crimes[25,5] <- 0.01
hate_crimes[42,5] <- 0.03
#no_na_crimes <- drop_na(no_na_crimes)
#no_na_crimes
#scaled_crimes <- scale(no_na_crimes)
head(scaled_crimes)
## median_household_income share_unemployed_seasonal
## [1,] -1.3890165 1.0081757
## [2,] 1.4616133 1.3933656
## [3,] -0.6045901 1.2970681
## [4,] -1.0917081 0.2377958
## [5,] 0.6585208 0.9118782
## [6,] 0.7094590 -0.9177739
## share_population_in_metro_areas share_population_with_high_school_degree
## [1,] -0.5772940 -1.3769565
## [2,] -0.6323279 1.3265294
## [3,] 0.8535865 -0.7664919
## [4,] -0.3021247 -1.2897472
## [5,] 1.2388236 -1.8130026
## [6,] 0.3032478 0.7160648
## share_non_citizen share_white_poverty gini_index share_non_white
## [1,] -0.9944839 1.1214842 1.0905787 0.35147457
## [2,] -0.3447544 -1.3954345 -1.7190090 0.83563431
## [3,] 1.6044341 -0.1369752 0.1353189 1.31979406
## [4,] -0.3447544 1.1214842 0.3038942 -0.27101653
## [5,] 2.5790284 -0.1369752 1.0343870 2.14978219
## [6,] 0.3049751 -0.9759481 0.2477024 0.07481186
## share_voters_voted_trump hate_crimes_per_100k_splc
## [1,] 1.30002752 -0.72949195
## [2,] 0.27589009 -0.63259858
## [3,] -0.03135115 -0.19103327
## [4,] 0.99278629 -1.03681376
## [5,] -1.77238479 -0.02602561
## [6,] -0.64583361 0.70315910
## avg_hatecrimes_per_100k_fbi
## [1,] -0.3224643
## [2,] -0.4475386
## [3,] 1.0205224
## [4,] -1.1054414
## [5,] 0.1717623
## [6,] 0.5115387
scaled_crimes <- scaled_crimes[-9,]
We will use Total Within Sum of Square vs. Number of clusters graph to find out the number of clusters that are suitable for our dataset.
fviz_nbclust(scaled_crimes, kmeans, nstart = 25, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)
From the graph, we conclude that the best number of clusters is four. Therefore, we apply k-means algorithm on the data set with four clusters as a parameter. The following print-outs are cluster assignments found.
set.seed(314)
km.res <- kmeans(scaled_crimes,4,nstart=25)
print(km.res$cluster)
## [1] 3 1 2 3 2 4 4 4 2 1 2 1 1 1 3 3 1 4 4 4 4 3 1 1 1 2 1 2 2 2 3 1 1 3 4 1 2 3
## [39] 1 3 2 1 1 4 4 3 1 1
We can visualize the clusters as following in two dimensional space.
fviz_cluster(km.res,
data = scaled_crimes,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
ellipse.type = "euclid", # Concentration ellipse
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal()
)
We will apply one more method, K-Medoids clustering to find the clusters in our data set.
fviz_nbclust(scaled_crimes, pam, method = "silhouette") +
theme_classic()
Using Silhouette Width method, we get the same result that four clusters is the best for the data set.
pam.res <- pam(scaled_crimes, 4)
kmedoid.df <- cbind(scaled_crimes, cluster = pam.res$cluster)
fviz_cluster(pam.res,
data = kmedoid.df,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
ellipse.type = "euclid", # Concentration ellipse
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal()
)
From KMedoids clustering, South Carolina, Wisconsin, Virginia and Washington comes out as medoids.
med_sil <- eclust(scaled_crimes, "pam", k = 4, hc_metric = "euclidean",
hc_method = "ward.D2", graph = FALSE)
fviz_silhouette(med_sil, palette = "jco",
ggtheme = theme_classic())
## cluster size ave.sil.width
## 1 1 10 0.32
## 2 2 18 0.21
## 3 3 9 0.23
## 4 4 11 0.16
The average silhouette width of KMedoids is 0.23, which is very small, and indicates that clustering may not be the best option for explaining the variance in the data.
row.names(PC) <- (PC$State)
PC <- subset(PC, select = -c(State))
pc.dist <- dist(PC[,1:2], method = "euclidean")
as.matrix(pc.dist)[1:6,1:6]
## Alabama Alaska Arizona Arkansas California Colorado
## Alabama 0.0000000 4.570026 3.588098 0.7122167 5.586621 5.073559
## Alaska 4.5700258 0.000000 3.458444 3.9181304 5.106521 1.029344
## Arizona 3.5880980 3.458444 0.000000 3.4135739 2.175756 3.128248
## Arkansas 0.7122167 3.918130 3.413574 0.0000000 5.523304 4.489836
## California 5.5866207 5.106521 2.175756 5.5233041 0.000000 4.417362
## Colorado 5.0735593 1.029344 3.128248 4.4898363 4.417362 0.000000
pc.hc <- hclust(d = pc.dist, method = "ward.D2")
fviz_dend(pc.hc, cex = 0.5)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Using Ward’s method, there seems to be four main clusters of states.
grp <- cutree(pc.hc, k = 4)
fviz_cluster(list(data = PC[,1:2], cluster = grp),
pallete = c("blue","orange","red","pink"),
ellipse.type = "convex",
repel = TRUE,
show.clust.cent = FALSE, ggtheme = theme_minimal())
res.coph <- cophenetic(pc.hc)
cor(pc.dist,res.coph)
## [1] 0.653878
The correlation between the cophenetic distance and the original distance is around 0.62 which is not the largest amount of correlation, so the clustering solution may not accurately reflect the data.
res.hc2 <- hclust(pc.dist,method = "average")
cor(pc.dist, cophenetic(res.hc2))
## [1] 0.684254
fviz_dend(res.hc2, cex = 0.5)
The correlation between cophenetic distance from using average linkage method and the original distance is around 0.64 is larger than the one from Ward’s method, although it is still not that large. Also the dendogram obtained with average linkage does not have as clear a place to “cut” the tree.
fviz_nbclust(PC[,1:2], kmeans, nstart = 25, method = "wss")
The graph of the function of WSS vs K, indicates that 4 is the optimal number of clusters for kmeans
set.seed(42)
km.res <- kmeans(PC, 4, nstart = 25)
print(km.res)
## K-means clustering with 4 clusters of sizes 10, 11, 10, 18
##
## Cluster means:
## pc1 pc2 pc3
## 1 2.5411565 -0.1175661 -0.5308790
## 2 0.7258250 2.0951723 0.6852901
## 3 0.2082508 -2.7048290 0.3607785
## 4 -1.9710082 0.2876142 -0.3242881
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 3 4 1 3 1
## Colorado Connecticut Delaware Florida Georgia
## 2 2 2 1 1
## Idaho Illinois Indiana Iowa Kansas
## 4 1 4 4 4
## Kentucky Louisiana Maine Maryland Massachusetts
## 3 3 4 2 2
## Michigan Minnesota Mississippi Missouri Montana
## 2 2 3 4 4
## Nebraska Nevada New Hampshire New Jersey New Mexico
## 4 1 4 2 1
## New York North Carolina North Dakota Ohio Oklahoma
## 1 3 4 4 3
## Oregon Pennsylvania Rhode Island South Carolina South Dakota
## 2 4 1 3 4
## Tennessee Texas Utah Vermont Virginia
## 3 1 4 4 2
## Washington West Virginia Wisconsin Wyoming
## 2 3 4 4
##
## Within cluster sum of squares by cluster:
## [1] 16.77693 34.56740 27.81483 53.21098
## (between_SS / total_SS = 67.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The clusters obtained explain roughly 77.4 percent of the variance in the data. This is a decent size, however, it is important to remember that clustering was done with respect to the first 2 principle components which explain less than 70 percent of the variance
fviz_cluster(km.res, data = PC[,1:2], xlab = "PC1 (34.2%)", ylab = "PC2 (30.4%)")
The clusters on the plane spanned by the first 2 principle component, do good job separating observations. However, the the first two principle components do not explain enough of the variance in the data to clustering with only them in mind.
mean_sil <- eclust(PC[,1:2], "kmeans", k = 4, hc_metric = "euclidean",
hc_method = "ward.D2", graph = FALSE)
fviz_silhouette(mean_sil, palette = "jco",
ggtheme = theme_classic())
## cluster size ave.sil.width
## 1 1 11 0.40
## 2 2 10 0.46
## 3 3 18 0.41
## 4 4 10 0.52
The average silhouette width is 0.44 and no clusters have not assigned points “incorrectly” according to this metric. The width is not the largest suggesting that the clusters might not be the most defined.
# Performing k-medoids clustering on the principal component scores
set.seed(123)
k <- 4
kmedoids_clusters <- pam(PC[,1:2], k = k)
#Clusters
fviz_cluster(kmedoids_clusters, data = PC_scores, geom = "point",
stand = FALSE, ellipse.type = "convex", ellipse = TRUE,
repel = TRUE)
The clusters provide a good amount of separation, and it is similar to the clusters obtained through kmeans, although a few points are assigned to different clusters.
PC_with_clusters <- cbind(PC[,1:2], Cluster = as.factor(kmedoids_clusters$clustering))
head(PC_with_clusters)
## pc1 pc2 Cluster
## Alabama 0.447603881 -3.1780209 1
## Alaska -0.797816329 1.2190309 2
## Arizona 2.376928173 -0.1527681 3
## Arkansas 0.012088819 -2.6144792 1
## California 4.296824306 0.8709089 3
## Colorado -0.004830473 1.8753252 2
med_sil <- eclust(PC[,1:2], "pam", k = 4, hc_metric = "euclidean",
hc_method = "ward.D2", graph = FALSE)
fviz_silhouette(med_sil, palette = "jco",
ggtheme = theme_classic())
## cluster size ave.sil.width
## 1 1 10 0.46
## 2 2 13 0.33
## 3 3 10 0.52
## 4 4 16 0.46
The kmedoid method does not “misclassify” any points and has an average silhouette width of 0.44, which is the same as the width obtained through kmeans.
library("clustertend")
## Package `clustertend` is deprecated. Use package `hopkins` instead.
set.seed(123)
hopkins(PC, n = nrow(PC)-1)
## Warning: Package `clustertend` is deprecated. Use package `hopkins` instead.
## $H
## [1] 0.4299043
The Hopkins statistic is pretty close to 0.5 for the first two principle components, indicating likely no clusters exist in the data.
fviz_dist(dist(PC), show_labels = FALSE) +
labs(title = "PCs")
VAT algorithm seems that there could be two clusters in the data. Although that’s just my opinion
library("NbClust")
nb <- NbClust(PC, distance = "euclidean", min.nc = 2, max.nc = 10, method = "kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 5 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
3 and 7 seems to be the optimal number of clusters.
library(clValid)
## Warning: package 'clValid' was built under R version 4.3.3
clmethods <- c("hierarchical", "kmeans", "pam")
intern <- clValid(PC, nClust = 2:6,
clMethods = clmethods, validation = "internal")
summary(intern)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical Connectivity 10.8762 16.1933 21.2274 24.0313 27.9492
## Dunn 0.1609 0.1970 0.2596 0.2813 0.2813
## Silhouette 0.2950 0.3356 0.3515 0.3175 0.3101
## kmeans Connectivity 12.6040 21.8472 21.2274 24.0313 27.9492
## Dunn 0.0542 0.1191 0.2596 0.2813 0.2813
## Silhouette 0.2936 0.3374 0.3515 0.3175 0.3101
## pam Connectivity 22.1060 18.9738 22.5385 34.5937 40.7294
## Dunn 0.0886 0.1904 0.0874 0.0874 0.0874
## Silhouette 0.2752 0.3392 0.3498 0.3001 0.2795
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 10.8762 hierarchical 2
## Dunn 0.2813 hierarchical 5
## Silhouette 0.3515 hierarchical 4
The 3 metrics do not agree on the best approach. According to connectivity hierarchical with 2 clusters is the best, according to Dunn kmeans with 6 clusters is the best and according to Silhouette kmeans with 4 clusters is the best.
The three statistical methods we used in this report had varying levels of success in accurately representing patterns in the data. Multiple linear regression resulted in a very small R^2, indicating that there is very little linear relationship between hate crimes per 100k and the rest of the variables. Principle component analysis was much more successful. The first three principle components were able to explain nearly 80 percent of the variance in the data, effectively reducing the dimensions of the data. Clustering had little success. Clustering on the original data yielded groupings with small average silhouette widths. Clustering with respect to the first two principle components had larger average silhouette width, but the results must be taken with a grain of salt since the first two principle components only explain roughly 65 percent of the variability in the data. Overall principle component analysis by itself seems to be the most effect method to represent the data set.