-
Notifications
You must be signed in to change notification settings - Fork 0
/
Gun Violence US.Rmd
933 lines (690 loc) · 44.3 KB
/
Gun Violence US.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
---
title: "Gun violence in the US"
output:
rmdformats::downcute:
use_bookdown: TRUE
lightbox: TRUE
gallery: TRUE
fig_caption: TRUE
code_folding: hide
---
```{r, include = FALSE }
knitr::opts_chunk$set(
echo = TRUE, warning = FALSE
)
```
# Executive summary
The US is one of the few countries in which the right to bear arms is constitutionally protected. A consequence of this is that the level of gun voilence is very high compared to other countries. After analyzing the dataset, which was downloaded from <a href="https://www.gunviolencearchive.org//"target="_blank">gunviolencearchive.org</a>, my main findings are:
- The trend is unfortunately upward. The number of incidents increased year-on-year from 2013 to 2017. The number of incidents in Q1 2018 is lower that this number in Q1 2017, which can be seen as a cautiously positive sign.
-On average, the number of incidents is lower in autumn/winter than in spring/summer. Weekends are more dangerous than weekdays, but this was obvious to me. The most dangerous dates are January 1st (New Year’s Day), and July 4th (Independence Day, with many incidents also happening the day after, July 5th).
- I enriched the data with population figures from the US Census Bureau. After having done that, Alaska came out as the (to me surprising) state with the relatively highest number of incidents. Besides Alaska, Louisiana and Delaware are also states with high average incident numbers. The safest state regarding the relative number of incidents is Hawaii. However, Alaska has a relatively low ratio of victims (number of people killed + injured) per incident. This seems to be caused by a high percentage of “non-shooting incidents” (I assume this mostly means incidents where people have only threatened to shoot). When ranking the states on the number of victims relative to the population size, Louisiana is the most dangerous state. In this statistic, Illinois jumps to second place, which is due to the state having the highest ratio of victims per incident.
- Of the big cities with at least 600,000 inhabitants, Baltimore, Washington DC, and Chicago have the highest relative numbers of incidents. Big cities are generally more dangerous than the country side regarding gun violence, as the incident rates are (much) higher.
- The most severe incident by far was the shooting at the Strip in Las Vegas (from the Mandalay Bay hotel), with a total of 470 people that were killed or injured.
- The incident characteristics offer a wealth of information. Altogether 110 categories were used to describe the incidents. On a high level, it for instance explains the low victims/incidents ratio of Alaska. However, categories such as terrorism, gang, and drug involvement were also recorded.
- I have included a map that summarizes drug related incidents. It turns out that Baltimore has many drug related incidents when compared to the size of the city (Internet quote: In a city of 645,000, the Baltimore Department of Health estimates there are 60,000 drug addicts, with as many as 48,000 of them hooked on heroin).
- Regarding gang related accidents, Chicago is the city with by far the most incidents.
- 30 incidents were labeled as “Terrorism involvement”, and I displayed those on a map too.
One of my goals of this analysis was to learn how to make interactive maps in R, and the my first interactive ```Leaflet``` maps are included in this version. In addition, I have also converted some ```ggplots``` to ```plotly``` objects, which enables interactive showing of labels with exact information.
# Introduction
Gun Violence Archive (GVA) is a not for profit corporation formed in 2013 to provide free online public access to information about gun-related violence in the United States. GVA states that America is an exceptional country when it comes to guns. It’s one of the few countries in which the right to bear arms is constitutionally protected. But America’s relationship with guns is unique in another crucial way: Among developed nations, the US is far and away the most homicidal. For instance, America has six times as many firearm homicides as Canada, and nearly 16 times as many as Germany.
The data in this dataset was downloaded from <a href="http://www.gunviolencearchive.org//"target="_blank">gunviolencearchive.org</a>. The CSV file contains data for all recorded gun violence incidents in the US between January 2013 and March 2018. The data consist of 240k incidents, with detailed information about each incident.
# Loading and Exploring Data
## Loading libraries required and reading the data into R
Loading R packages used besides base R.
```{r loadlib, echo=T, results='hide', message=F, warning=F}
library(knitr)
library(dplyr)
library(readr)
library(ggplot2)
library(tibble)
library(stringr)
library(gridExtra)
library(scales)
library(lubridate)
library(ggrepel)
library(leaflet)
library(rgdal)
library(splitstackshape)
```
Loading data and adding the large, missing Las Vegas incident in 2017 manually. The dataset poster explains why he took this incident out in this discussion topic: <a href="https://www.kaggle.com/jameslko/gun-violence-data/discussion/55307//"target="_blank">Las Vegas incident missing</a>.
```{r}
gun <-
as.tibble(data.table::fread(
str_c("gun-violence-data_01-2013_03-2018.csv"),
header = TRUE,
stringsAsFactors = FALSE,
na.strings = c("NA", "")
))
lasvegas <-
list(
999999,'2017-10-01','Nevada','Las Vegas','Mandalay Bay 3950 Blvd S',
59,411,'https://en.wikipedia.org/wiki/2017_Las_Vegas_shooting',
'https://en.wikipedia.org/wiki/2017_Las_Vegas_shooting',
NA,NA,NA,NA,NA,36.095,'Mandalay Bay Hotel',-115.171667,47,
'Route 91 Harvest Festiva; concert, open fire from 32nd floor. 47 guns seized; TOTAL:59 kill, 441 inj, number shot TBD,girlfriend Marilou Danley POI',
NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
)
gun <- rbind(gun, lasvegas)
```
## Data size and structure
```{r}
glimpse(gun)
```
Many variables need to be reformatted before they can be used. I chose to do this by variable as soon as I need the variable for a particular exploration.
# Data exploration
## Comparing number of incidents by year, quarter, month, and weekday
First of all, I need to reformat the Date variable.
```{r}
gun$date <- ymd(gun$date)
str(gun$date)
```
```{r}
summary(gun$date)
```
### By year
It seems as if most incidents in 2013 were not recorded. In addition, 2018 is not a full year of data as the latest recorded incident was on May 31st 2018. As this means exactly one quarter, I am interested if the trend is still upward if I only look at the Q1s of 2014-2018. I will look into this in the next section.
```{r}
gun$year <- year(gun$date) #extract year from date using the lubridate year() function
gun %>%
ggplot(aes(x = as.factor(year))) +
geom_bar(stat = 'count', fill = 'red') +
scale_y_continuous(labels = comma) +
geom_label(stat = "count", aes(label = ..count.., y = ..count..))
```
### By quarter; what does Q1 2018 tell us?
There seems to be some sort of ‘seasonality’ with Q1 and Q4 generally having lower numbers of incidents than Q2 and Q3. The second graph shows that Q1 2018 holds less incidents than Q1 2017. This could be seen as a cautiously positive sign. However, one should realize that this number is still very high when compared to other countries (relatively).
```{r}
gun$quarter <- quarter(gun$date) #extract Quarters from date
q1 <-
gun %>% filter(year != 2013) %>%
select(year, quarter) %>% group_by(year) %>%
count(quarter) %>%
ggplot(aes(x = as.factor(quarter), y = n)) +
geom_bar(stat = 'identity', fill = 'red') +
scale_y_continuous(labels = comma) +
facet_grid(. ~ year) + labs(x ='Quarter', y = 'Number of incidents')
q2 <-
gun %>% filter(year != 2013 & quarter == 1) %>%
select(year, quarter) %>%
group_by(year) %>%
count(quarter) %>%
ggplot(aes(x = as.factor(year), y = n)) +
geom_bar(stat = 'identity', fill ='red') +
scale_y_continuous(labels = comma) +
labs(x = 'Incidents in Q1 of each year', y ='Number of incidents')
grid.arrange(q1, q2)
```
### By month
The analysis of quarters shows that more incidents occur in the warmer spring and summer periods. This seems worth diving into a little deeper. In order to compare months I will exclude 2018, as only the first three months have been recorded.
The most visible ‘seasonality’ effect seems to me that the colder months seem to have less incidents. November, December, and February are the 3 months with the lowest number of incidents (February also only has 28 days of course). The exception seems to be January, which is worth investigating later on. My first idea is that possibly incidents on new years eve contribute to January having a high number of incidents.
The other peak is the July/August period. I think that the fact that many people go on holidays in this period is the most likely explanation.
```{r}
gun$month <- month(gun$date, label=TRUE)
#only taking the complete years 2014-2017
plotly::ggplotly(gun %>%
filter(year!=c(2013, 2018)) %>%
count(month) %>%
ggplot(aes(x=month, y=n)) +
geom_bar(stat='identity', fill='red') +
scale_y_continuous(labels=comma) +
labs(x='Month', y='Number of incidents', title='Incidents by Month'))
```
**Hovering over the bars will show a label with the exact number for each month**
#### Dates with most incidents
As I mentioned new years eve as a possible explanation of the high January numbers, while the colder month Nov-Feb generally seem to hold less incidents, I wanted to check what the dates with most incidents are.
```{r}
gun$day <- day(gun$date)
gun <- gun %>% mutate(date2=paste(month, day))
kable(gun %>%
filter(year!=c(2013, 2018)) %>%
count(date2) %>% top_n(10) %>%
arrange(desc(n)) %>%
rename(date=date2, "total number of incidents"=n))
```
The numbers above are actually the totals of 4 dates, as they are aggregated over 4 years (for instance: 1-1-2014, 1-1-2015, 1-1-2016, 1-1-2017). With the average being 618 per date (the total number of incidents in 2014-2017 divided by 365 calendar days; 225598/365), there are not many dates that really stand out. Most of the dates in this top 10 seem ‘ordinary’ days in July/August. However, January 1st indeed partially explains the higher incidents numbers in January. In addition, independence day (July 4th) is also dangerous when it comes down to gun incidents. I assume that the high number on July 5th is due to people continuing to celebrate after midnight.
### By weekday
Weekends (Saturday and Sunday) are more dangerous than working days. My best guess is that this is due to most people not having to work on these days, and also very likely due to nightlife related violence. Violence happening on Friday nightlife will likely happen after midnight to a large extend, and therefore be recorded as event on Saturdays.
```{r}
gun$weekday <- wday(gun$date, label=TRUE)
gun %>%
count(weekday) %>%
ggplot(aes(x=weekday, y=n)) +
geom_bar(stat='identity', fill='red') +
scale_y_continuous(labels=comma) +
labs(x='Weekday', y='Number of incidents', title='Incidents by Weekday')
```
## Comparing number of incidents and victims by location
### Incidents by State
First of all, I have to convert State into a factor variable. As I will also analyze city_or_county later on, I will convert them both at the same time.
```{r}
gun[, c('state', 'city_or_county')] <- lapply(gun[, c('state', 'city_or_county')], as.factor)
str(gun$state)
```
```{r}
str(gun$city_or_county)
```
Below, I am displaying the absolute numbers of incidents by state. However, these numbers should be related to the numbers of inhabitants to get a good view of the relative numbers of incidents. For instance, California is a state with a very large population. Therefore, it is no surprise to see California as the number two in absolute numbers.
```{r, warning=FALSE}
plotly::ggplotly(
gun %>%
count(state) %>%
ggplot(aes(x=reorder(state, n), y=n, fill=n, text=state)) +
geom_bar(stat='identity', fill='red') +
coord_flip() +
labs(x='', y='Number of incidents'),tooltip=c("text", "y"))
```
**Hovering over the bars shows a label with the absolute Number of Incidents for a State**
#### Incidents relative to the State population size
In order be able to relate the numbers above to the population sizes, I had to look for a good source. I found what I needed on the website of the US Census Bureau.
```{r}
fileUrl <-
"https://www2.census.gov/programs-surveys/popest/datasets/2010-2017/state/asrh/scprc-est2017-18+pop-res.csv"
download.file(fileUrl, destfile = "C:/Users/onur-/Downloads/Homepage/Portfolio/Gun violence in the US/censusData.csv")
statesPop <- read_csv("censusData.csv")
statesPop <- statesPop %>%
select(NAME, POPESTIMATE2017) %>%
filter(!NAME %in% c("United States", "Puerto Rico", "Puerto Rico Commonwalth")) %>%
rename(state = NAME) %>%
mutate(state = as.factor(state))
```
Below, I am creating a small “states” dataframe, that eventually holds the number of incidents relative to the population of each state (per 100,000 inhabitants), and the tables shows the first 6 states in alphabetical order.
```{r}
incidentsByState <-
gun %>%
group_by(state) %>%
summarize(stateIncidents = n())
incidentsByState <-
left_join(incidentsByState, statesPop, by = "state")
incidentsByState$Per100000 <-
round((
incidentsByState$stateIncidents / incidentsByState$POPESTIMATE2017
) * 100000)
kable(head(incidentsByState))
```
In the figure below, you can see that the enriched data, which are related to the population of each state, paint a very different picture. As the numbers of incidents are related to the population sizes, these numbers now represent ‘real’ gun danger levels. To show this visually, I have used color codes. Red means a high danger level in terms of relative numbers of incidents, and yellow means that a state is relatively safe.
Now, Alaska, Louisiana and Delaware are showing the highest relative incident numbers. Hawaii seems the safest state to live in (regarding gun related incidents, earthquakes and volcano eruptions are a different story :-)), and the large state of California drops from second state in terms of absolute incidents to a low position when corrected for the large population.
```{r}
plotly::ggplotly(
incidentsByState %>%
filter(state!="District of Columbia") %>%
ggplot(aes(x=reorder(state, Per100000), y=Per100000, fill=Per100000, text=state)) +
geom_bar(stat='identity') +
coord_flip() +
labs(x='', y='Incidents per 100,000 inhabitants') +
scale_fill_gradient(low="yellow", high="red") +
theme(legend.position="none"),tooltip=c("text", "y"))
```
**Hovering over the bars shows a label with the relative Number of Incidents for a State**
Note: actually the District of Columbia came out as the state with the highest incidents rate. In all honesty, the state “District of Columbia (DoC)” did not ring any bells for me. However, a quick internet search told me that it is not really a state but actually a “federal district” not related to any state and roughly equal to the city Washington DC (see also <a href="https://www.huffingtonpost.com/2012/12/17/dc-gun-violence-statistics_n_2316944.html//"target="_blank">Huffington Post</a>, <a href="https://www.tripsavvy.com/is-the-district-of-columbia-a-state-1038984//"target="_blank">Tripsavvy</a>). As these incidents have to assigned anyway, I understand that DoC is labeled as a state in the data. However, since Washington DC is already dealt with as a city in section 4.2.3.1, I decided to take it out of the graph above.
Alaska being the state with relatively most incidents came as a surprise to me! Of course, I wanted to find out where this high level of gun violence in Alaska might come from. The following link explains why that the data in this dataset are not lying: <a href="http://www.businessinsider.com/the-state-where-youre-most-likely-to-be-killed-by-a-gun-is-one-of-the-most-beautiful-places-on-earth-2015-6?international=true&r=US&IR=T//"target="_blank">Here’s why Alaska’s gun problem is so bad</a>. In summary, Alaska has very liberal gun laws, very high gun possession as a consequence (60% of homes), and also a very high level of suicides (80% of firearm deaths are self-inflicted).
To get an idea on how these relative figures are composed, I am displaying the underlying numbers of a few states below. As you can see, DoC and Alaska have few inhabitants but a relatively high number of incidents. Hawaii just has low numbers of incidents, and the large state of California comes out with a relatively low number of incidents when taking the population into account.
```{r message=FALSE, warning=FALSE}
kable(incidentsByState %>%
filter(state %in% c('District of Columbia', 'Alaska', 'California', 'Hawaii')))
```
#### Is the high Alaska incidents rate statistically significant?
Heads or Tails raised the question if the higher figure for Alaska was significant giving the relatively small population size. To find that out, I added a 2 proportions z-test.
```{r}
incidents_alaska <- incidentsByState$stateIncidents[incidentsByState$state=='Alaska']
incidents_restUS <- sum(incidentsByState$stateIncidents[incidentsByState$state!='Alaska'])
n_alaska <- incidentsByState$POPESTIMATE2017[incidentsByState$state=='Alaska']
n_restUS <- sum(incidentsByState$POPESTIMATE2017[incidentsByState$state!='Alaska'])
prop.test(x=c(incidents_alaska, incidents_restUS), n=c(n_alaska, n_restUS))
```
The p-value of this test is very small, so we can conclude that the incidents rate in Alaska is significantly higher than the rate of the rest of the US. The sample estimates should be read as: prop1 is the Alaska estimate. If we multiply this number by 100,000 we are getting the 182 incidents per 100,000 inhabitants. The rest of the US has an average of 73 incidents per 100,000 (prop 2 * 100,000).
The confidence interval should be read as follows:
- the point estimate of the difference is 182-73=109
- with 95% confidence the real difference should be between 99 and 118
- the fact that this confidence interval does not include 0 also means that the difference is significant.
Since the statistical margins of error are reasonably small, about plus or minus 10 for a small state like Alaska, I am not going to include any further statistical test in this analysis.
#### An interactive map of incidents by state with Leaflet
As mentioned in the Executive Summary, one of my goals when taking on this dataset was to learn how to create interactive maps. I downloaded the spatial data from the US Census bureau. Below, I am using the package ```rgdal``` to read these spatial data.
```{r}
fileUrl2 <-
"http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_state_500k.zip"
download.file(fileUrl2,
destfile = "C:/Users/onur-/Downloads/Homepage/Portfolio/Gun violence in the US/us.zip")
path3 <- unzip("us.zip")
states <-
readOGR(dsn = getwd(),
layer = "cb_2017_us_state_500k",
encoding = "UTF-8")
```
```{r}
class(states)
```
This spatial polygons dataframe also contains data elements, as you can see below.
```{r}
kable(head(states@data)) #please notice the different notation using `@` instead of the usual '$'
```
With this datastructure, a straightforward join (joining the Per100000 column) with additional data is not possible. However, I found out that an easy workaround is to extract data elements from the object, join this small dataframe with the info that I want to display on the map (the incidents per 100,000 inhabitants), and add this column in the ‘states’ object.
```{r}
addPer100k <- data.frame(id=states$GEOID, name=states$NAME)
addPer100k <- left_join(addPer100k, incidentsByState %>% select(state, Per100000) %>% rename(name=state), by="name")
addPer100k$Per100000[is.na(addPer100k$Per100000)] <- 0
states$per100k <- addPer100k$Per100000
```
Now it is time to create the map. Please feel free to hover over the states (this shows a popup with state name and the exact number of incidents per 100,000 inhabitants), and zoom in and out. By zooming out, you will be able to see Alaska and Hawaii. By zooming in, you can also find the District of Columbia (between Maryland and Virginia).
```{r}
bins <- c(0, 50, 75, 100, 150, Inf)
pal <- colorBin("YlOrRd", domain = states$per100k, bins = bins)
state_popup <- paste0(
"<strong>State: </strong>",
states$NAME,
"<br><strong>Incidents per 100,000 inhabitants </strong>",
states$per100k
) %>% lapply(htmltools::HTML)
leaflet(data = states) %>%
setView(lng = -96, lat = 37.8, zoom = 4) %>%
addProviderTiles("MapBox",
options = providerTileOptions(
id = "mapbox.light",
accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN')
)) %>%
addPolygons(
fillColor = ~ pal(per100k),
weight = 2,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE
),
label = state_popup,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"
)
) %>%
addLegend(
pal = pal,
values = ~ per100k,
opacity = 0.7,
title = "Incidents",
position = "bottomright"
)
```
**Please hover over a state to show the details in a popup**
### Victims by State
First of all, I will create a new variable that adds up the number of killed and injured people for each incident.
```{r}
gun$victims <- gun$n_killed + gun$n_injured
```
#### Severity of Incidents
As you can see in the table above, Alaska actually has a relatively low number of victims per incident (0.44). However, as shown in the graph, Louisiana also ranks high with regards to this ratio.
```{r}
VictimsByState <-
gun %>% group_by(state) %>% summarize(
sumVic = sum(victims),
sumInj = sum(n_injured),
sumDeath = sum(n_killed),
PercDeath = round(sumDeath / sumVic, 2),
sumIncidents = n(),
vicPerInc = round(sumVic / sumIncidents, 2)
)
head(VictimsByState)
```
```{r}
VictimsByState %>%
filter(vicPerInc>0.8) %>%
ggplot(aes(x=reorder(state, -vicPerInc), y=vicPerInc)) +
geom_bar(stat='identity', fill='red') +
labs(x='State', y='Victims per incidents') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
#### Victims relative to the state population sizes
In the graph below, which again uses the color codes to show danger levels, you can see that Louisiana is the most dangerous state when it comes down to the number of victims relative to the population size. This is no surprise, as the state has a high relative number of incidents and also ranks high regarding the victims per incident ratio. Alaska drops to 9th place, which is due to the relatively low number of victims per incident.
```{r}
VictimsByState <- left_join(VictimsByState, statesPop, by = "state")
VictimsByState$Per100000 <-
round((VictimsByState$sumVic / VictimsByState$POPESTIMATE2017) * 100000)
plotly::ggplotly(
VictimsByState %>%
filter(state != "District of Columbia") %>%
ggplot(aes(
x = reorder(state, Per100000),
y = Per100000,
fill = Per100000,
text = state
)) +
geom_bar(stat = 'identity') +
coord_flip() +
labs(x = '', y = 'Victims per 100,000 inhabitants') +
scale_fill_gradient(low ="yellow", high = "red") +
theme(legend.position = "none"),tooltip = c("text", "y")
)
```
**Hovering over the bars shows a label with the relative Number of Victims for a State**
#### An interactive map of victims by state with Leaflet
Making a leaflet of these figures is now easy, as I can reuse most code. All I need to do is replace the incident figures by the victim numbers.
```{r}
addPer100k <- addPer100k %>% select(id, name)
addPer100k <-
left_join(addPer100k,
VictimsByState %>% select(state, Per100000) %>% rename(name = state),
by = "name")
addPer100k$Per100000[is.na(addPer100k$Per100000)] <- 0
states$per100k <- addPer100k$Per100000
bins <- c(0, 25, 50, 75, 100, Inf)
pal <- colorBin("YlOrRd", domain = states$per100k, bins = bins)
state_popup <- paste0(
"<strong>State: </strong>",
states$NAME,
"<br><strong>Victims per 100,000 inhabitants </strong>",
states$per100k
) %>% lapply(htmltools::HTML)
leaflet(data = states) %>%
setView(lng = -96, lat = 37.8, zoom = 4) %>%
addProviderTiles("MapBox",
options = providerTileOptions(
id = "mapbox.light",
accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN')
)) %>%
addPolygons(
fillColor = ~ pal(per100k),
weight = 2,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE
),
label = state_popup,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"
)
) %>%
addLegend(
pal = pal,
values = ~ per100k,
opacity = 0.7,
title = "Victims",
position = "bottomright"
)
```
**Please hover over a state to show the details in a popup**
### Incidents by city
It seems likely that a lot of violence occurs in the big cities of the states. For instance, I expected to see a lot incidents in the big cities of Illinois (Chicago) and Louisiana (New Orleans). The biggest city in Alaska (Anchorage) is unlikely to appear in the list of cities with the highest absolute numbers of incidents as the city is relatively small.
```{r}
incidentsByCity <-
gun %>%
select(city_or_county, state) %>%
rename(city=city_or_county) %>%
group_by(city, state) %>%
summarize(cityIncidents=n())
```
New York City did not seem to be recorded as a whole in the dataset. It turns out that the boroughs are recorded separately.
```{r}
incidentsByCity[(incidentsByCity$city %in% c('Brooklyn', 'Bronx', 'Queens', 'Staten Island','New York (Manhattan)')) & incidentsByCity$state=='New York',]
```
Before I continue, I am adding New York as a city.
```{r}
sumNewYork <- sum(incidentsByCity$cityIncidents[(incidentsByCity$city %in% c('Brooklyn', 'Bronx', 'Queens', 'Staten Island','New York (Manhattan)')) & incidentsByCity$state=='New York'])
NewYork <- data.frame(city='New York', state='New York', cityIncidents=sumNewYork)
incidentsByCity <- as.tibble(rbind(as.data.frame(incidentsByCity), NewYork))
```
```{r}
incidentsByCity %>%
top_n(20, wt=cityIncidents) %>%
ggplot(aes(x=reorder(city, cityIncidents), y=cityIncidents)) +
geom_bar(stat='identity', fill='red') +
labs(x='City', y='Number of incidents') +
coord_flip()
```
Although I expected Chicago to have a lot of incidents, I am still surprised to see that it holds the ‘number one spot’ by a huge margin! In the next section, I will relate those absolute numbers to the population by city.
#### Incidents relative to the City population size
```{r}
fileUrl3 <-
"https://raw.githubusercontent.com/Haykall/superhero-API/master/uscitiesv1.4.csv"
download.file(fileUrl3,
destfile = "C:/Users/onur-/Downloads/Homepage/Portfolio/Gun violence in the US/uscitys.csv")
citiesPop <- read_csv("uscitys.csv", col_types = cols())
citiesPop <-
citiesPop %>%
select(city, state_name, population_proper) %>%
rename(state =state_name, population = population_proper) %>%
filter(population > 600000)
```
As this is a large file, I decided to only focus on the bigger cities. Selecting all cities with at least 600,000 inhabitant just includes Baltimore; the number 2 in absolute numbers of incidents.
```{r}
citiesPop <-left_join(citiesPop, incidentsByCity, by=c("city", "state"))
citiesPop$Per100000 <- round((citiesPop$cityIncidents/citiesPop$population)*100000)
citiesPop$citystate <- str_c(citiesPop$city, " - " ,citiesPop$state)
incidentsByState <- incidentsByState %>% rename(state_avg=Per100000)
citiesPop <- left_join(citiesPop, incidentsByState %>% select(state, state_avg), by="state")
citiesPop1 <-
citiesPop %>%
select(citystate, Per100000, state, state_avg) %>%
rename(city_avg=Per100000)
gathercols <- c("city_avg", "state_avg")
CitiesStatesLong <- tidyr::gather(citiesPop1, city_or_state, per100k, gathercols)
citiesTop20 <-
CitiesStatesLong %>%
filter(city_or_state=='city_avg') %>%
arrange(per100k) %>%
top_n(20, wt=per100k)
Top20names <- citiesTop20$citystate
CitiesStatesLong <- CitiesStatesLong[CitiesStatesLong$citystate %in% Top20names,]
```
As you can see, the city averages of the cities with high incident numbers are much higher than their associated state averages. At the bottom of this list (not displayed) a few cities have lower numbers than the state , but generally cities are more dangerous than the country side.
```{r}
ggplot(
CitiesStatesLong, aes(x=factor(citystate), y=per100k, fill=city_or_state)) +
geom_bar(stat="identity", position = position_dodge2(reverse=TRUE, padding=0.1)) +
coord_flip() +
scale_fill_manual(values = c("state_avg"="yellow", "city_avg"="red")) +
scale_x_discrete(limits=Top20names) +
labs(y='Incidents per 100,000 inhabitants', x="")
```
## Incidents with highest numbers of victims
Below, I am displaying the 13 incidents with the highest numbers of victims. Although any single victim is one too many, the shooting in Las Vegas was by far the worst incident with over 470 victims.
```{r}
Top10 <-
gun %>%
select(incident_id, date, n_killed, n_injured, victims, location_description, city_or_county, state, latitude, longitude) %>%
rename(Incident_Id=incident_id, Date=date, Killed=n_killed, Injured=n_injured, Victims=victims, Location=location_description, City=city_or_county) %>%
arrange(desc(Victims)) %>%
top_n(n=13, wt=Victims)
```
```{r}
kable(Top10 %>% select(-longitude, -latitude))
```
### An interactive map of the incidents with highest numbers of victims
```{r}
TopMap <- Top10 %>% select(latitude, longitude, Victims, City, Location)
labels <- paste0("<strong>City: </strong>", TopMap$City,
"<br><strong>Location: </strong>", TopMap$Location,
"<br><strong>Victims </strong>", TopMap$Victims) %>% lapply(htmltools::HTML)
leaflet(TopMap) %>%
setView(lng=-96, lat=37.8, zoom=4) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(~longitude, ~latitude, color = "red", radius=~sqrt(Victims), label = labels)
```
**Please hover over an incident to show a label with the City, Location and the Number of Victims**
## Analyzing the Incident Characteristics
As you can see below, the ‘incident_characteristics` field actually holds a lot in information. It seems as if the person who entered the info had the choice to select multiple prescribed categories. These are separated by ’||’ in the incident_characteristics column. The first two incidents are both describes by 4 categories, but this varies among the data.
```{r}
head(gun$incident_characteristics,2)
```
The function ```cSplit``` of package ```splitstackshape``` works perfectly to split this column and store all characteristics as separate observations in a new dataframe. Besides the incident_id, I am also keeping the state and city, as I also want to see the specifics for for instance Alaska and Baltimore.
```{r}
#replacing "||" with "|" as both separators are used
gun$incident_characteristics <- gsub("\\|\\|", "|", gun$incident_characteristics)
IncCharac <- splitstackshape::cSplit(
gun %>%
select(incident_id, state, city_or_county, incident_characteristics), 'incident_characteristics', sep = '|', direction="long")
numCat <- round(nrow(IncCharac)/nrow(gun),1)
cat('On average, there are', numCat, 'incident categories specified per incident')
```
Below, you can see that the first two incidents are now split nicely into 8 observations.
```{r}
kable(head(IncCharac,8))
```
### Incident categories in the US
As you can see, there are lots of different categories. These do not necessarily all involve shots being fired. I for instance assume that ‘Non-Shooting Incident’ means that people have just threatened to shoot, or possible have used their gun as a striking weapon.
```{r}
IncCharac %>%
count(incident_characteristics) %>%
top_n(30, wt=n) %>%
ggplot(aes(x=reorder(incident_characteristics, n), y=n)) +
geom_bar(stat='identity', fill='red') +
coord_flip() +
labs(x='Incident Category', y='number of incidents')
```
As the first 4 categories seem overall categories, I will display numbers on these 4 categories separately in the remainder of this section.
```{r}
overallCats <- c("Shot - Wounded/Injured", "Shot - Dead (murder, accidental, suicide)", "Non-Shooting Incident", "Shots Fired - No Injuries")
```
```{r}
TableOverallCats <-
IncCharac %>%
filter(incident_characteristics %in% overallCats) %>%
count(incident_characteristics)
cat('For', round((sum(TableOverallCats$n)/nrow(gun))*100), 'percent of incidents, an overall category is specified')
```
In my opinion, this is close enough to 100%. I am going to assume that the distribution over those 4 categories for the remaining 5% of incidents is roughly the same.
#### Comparing the main incident categories by state
First of all, I am going to see if there are any significant differences when comparing those categories for the three states with the highest incident rates with the distribution within the US overall.
```{r}
coloursShot <- c("Shot - Wounded/Injured"="orange", "Shot - Dead (murder, accidental, suicide)"="red", "Non-Shooting Incident"="green", "Shots Fired - No Injuries"="yellow")
#creating a function to vary the x-axis scale (next plot uses same graph with different scale)
usCats <- function(fixedX=0.5){
IncCharac %>% filter(incident_characteristics %in% overallCats) %>%
count(incident_characteristics) %>%
ggplot(aes(x=reorder(incident_characteristics, n), y=n/sum(n), fill=factor(incident_characteristics))) +
geom_bar(stat='identity', width = 0.5) + scale_fill_manual(values = coloursShot) +
theme(legend.position="none") + coord_flip(ylim = c(0, fixedX)) + labs(x="", y='US overall') +
scale_y_continuous(labels=percent)
}
#creating a function to create plots by state
stateCats <- function(stateName){
IncCharac %>% filter(state==stateName & incident_characteristics %in% overallCats) %>%
count(incident_characteristics) %>%
ggplot(aes(x=reorder(incident_characteristics, n), y=n/sum(n), fill=factor(incident_characteristics))) +
geom_bar(stat='identity', width = 0.5)+ scale_fill_manual(values = coloursShot) +
theme(legend.position="none") + coord_flip(ylim = c(0, 0.5)) + labs(x="", y=stateName) +
scale_y_continuous(labels=percent)
}
usOverallCats <- usCats()
alaskaCats <- stateCats('Alaska')
delawareCats <- stateCats('Delaware')
louisianaCats <- stateCats('Louisiana')
grid.arrange(usOverallCats, alaskaCats, delawareCats, louisianaCats, ncol=1)
```
Regarding the states with relatively most incidents, we can say that compared to the US averages:
- In Louisiana, incidents are more severe with both higher percentages of incidents with people wounded and dead
- In Delaware, there were more incidents with people wounded, but also less incidents with people killed
- In Alaska, incidents were less severe with less incidents with wounded people and more incidents with no-shooting
#### Comparing the main incident categories by city
```{r}
#creating a function to create plots by city
cityCats <- function(cityName){
IncCharac %>%
filter(city_or_county==cityName & incident_characteristics %in% overallCats) %>%
count(incident_characteristics) %>%
ggplot(aes(x=reorder(incident_characteristics, n), y=n/sum(n), fill=factor(incident_characteristics))) +
geom_bar(stat='identity', width = 0.5) +
scale_fill_manual(values = coloursShot) +
theme(legend.position="none") +
coord_flip(ylim = c(0, 0.8)) +
labs(x="", y=cityName) +
scale_y_continuous(labels=percent)
}
usOverallCats <- usCats(0.8)
baltimoreCats <- cityCats('Baltimore')
washingtonCats <- cityCats('Washington')
chicagoCats <- cityCats('Chicago')
grid.arrange(usOverallCats, baltimoreCats, washingtonCats, chicagoCats, ncol=1)
```
Regarding the cities with relatively most incidents, we can say that:
- In Baltimore the percentage of incidents with people wounded is high (20% higher than US average). In addition, there are very few incidents with shots fired and people not injured.
- In Washington DC, despite having a high average of incidents per 100,000 incidents, incidents are on average less severe than the US average. Relatively, there are more Non-Shooting incidents, and less incidents with deadly victims.
- In Chicago, the percentage of incidents with wounded people is very high. On the other hand, the percentage of incidents with deadly victims is a bit lower than the US average.
### Opportunities offered by the smaller incident categories
The incident categories offer a wealth of information, and actually 110 categories are recorded in the database. Below, I am displaying the first 10 categories alphabetically.
```{r}
CatTable <- IncCharac %>% count(incident_characteristics)
kable(CatTable[1:10,])
```
#### Sub categories of the deadly incidents
The overall category “Shot - Dead (murder, accidental, suicide)” for instance consists of multiple sub categories.
```{r}
Dead <- c("Shot - Dead (murder, accidental, suicide)", "Accidental Shooting - Death", "Murder/Suicide", "Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location)", "Suicide^", "Attempted Murder/Suicide (one variable unsuccessful)")
CatTable %>% filter(incident_characteristics %in% Dead)
```
I assume that murder/suicide means that the person who committed suicide first killed other people. If I have the time, I will check if this list of sub categories is complete, and include a visualization. Another sub category that is worth investigating is “Suicide”, as for instance the article on Alaska suggests a high suicide rate in this state.
#### Terrorism, gang involvement, and drug involvement
Other sub categories that can be visualized well with leaflet maps are terrorism, gang, and drugs involvement.
```{r}
Involvement <- c("Terrorism Involvement", "Drug involvement", "Gang involvement")
kable(CatTable %>% filter(incident_characteristics %in% Involvement))
```
**Drug involvement:**
I chose to first make a map of drug involvement, as this forces me to use clusting of incidents to make the map readable. After having experimented with zoom levels, I decided to zoom in on the Baltimore/Washington DC area, as especially Baltimore had a high number drug related incidents. A quick internet search learned me that this is correct, see also Baltimore Is the U.S. <a href="https://abcnews.go.com/US/story?id=92699&page=1/" target="_blank">Heroin Capital</a>.
Of course, you can zoom out but this has the disadvantage that areas become so large that multiple cities are aggregated. Swiping through the map while keeping the zoom level the same gives the best overview on the “city” level in my opinion. Clicking on the clusters automatically zooms in on that particular area, and breaks down a cluster into smaller clusters. You can continue to do so until you see all the individual incidents.
```{r message=FALSE, warning=FALSE}
Drugs <- IncCharac %>% filter(incident_characteristics=="Drug involvement")
Drugs <- left_join(Drugs, gun %>% select(incident_id, longitude, latitude, location_description, victims), by="incident_id")
labels <- paste0("<strong>City: </strong>", Drugs$city_or_county,
"<br><strong>Location: </strong>", Drugs$location_description,
"<br><strong>Victims </strong>", Drugs$victims) %>% lapply(htmltools::HTML)
leaflet(Drugs) %>%
setView(lng=-76.6, lat=39.3, zoom=8) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron", group="Light map") %>%
addProviderTiles("Esri.NatGeoWorldMap", group= "Green map") %>%
addScaleBar %>%
addMarkers(~longitude, ~latitude,
label = labels,
clusterOptions = markerClusterOptions()) %>%
addLayersControl(baseGroups = c("Green map", "Light map"), options = layersControlOptions(collapsed = FALSE))
```
**Gang involvement:**
The map in this section is zoomed in on Chicago, as this city has by far the most incidents with gang involvement. Please also have a look at this article: <a href="https://www.therichest.com/rich-list/most-shocking/the-most-gang-infested-cities-in-america/" target="_blank">The 6 Most Gang Infested Cities in America</a>
> In recent years Chicago has made a dramatic leap to overcome Los Angeles as America’s gang capital with a staggering 150,000 gang members and become one of the most violent cities in the country.
Clicking on the clusters automatically zooms in on that particular area, and breaks down a cluster into smaller clusters. You can continue to do so until you see all the individual incidents.
```{r message=FALSE, warning=FALSE}
Gangs <- IncCharac %>% filter(incident_characteristics=="Gang involvement")
Gangs <- left_join(Gangs, gun %>% select(incident_id, longitude, latitude, location_description, victims), by="incident_id")
labels <- paste0("<strong>City: </strong>", Gangs$city_or_county,
"<br><strong>Location: </strong>", Gangs$location_description,
"<br><strong>Victims </strong>", Gangs$victims) %>% lapply(htmltools::HTML)
leaflet(Gangs) %>%
setView(lng=-87.6, lat=41.9, zoom=6) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron", group="Light map") %>%
addProviderTiles("Esri.NatGeoWorldMap", group= "Green map") %>%
addScaleBar %>%
addMarkers(~longitude, ~latitude,
label = labels,
clusterOptions = markerClusterOptions()) %>%
addLayersControl(baseGroups = c("Green map", "Light map"), options = layersControlOptions(collapsed = FALSE))
```
**Terrorism**
The 30 terrorism incidents are shown on the map below. Hovering over an incident shows a label with the city, location, and the number of victims.
```{r message=FALSE, warning=FALSE}
Terror <- IncCharac %>% filter(incident_characteristics=="Terrorism Involvement")
Terror <- left_join(Terror, gun %>% select(incident_id, longitude, latitude, location_description, victims), by="incident_id")
labels <- paste0("<strong>City: </strong>", Terror$city_or_county,
"<br><strong>Location: </strong>", Terror$location_description,
"<br><strong>Victims </strong>", Terror$victims) %>% lapply(htmltools::HTML)
leaflet(Terror) %>%
setView(lng=-96, lat=37.8, zoom=4) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(~longitude, ~latitude, color = "red", radius=~sqrt(victims), label = labels)
```