Gun Violence US.Rmd


---
title: "Gun violence in the US"
output: 
  rmdformats::downcute:
    use_bookdown: TRUE 
    lightbox: TRUE
    gallery: TRUE
    fig_caption: TRUE
    code_folding: hide
---



```{r, include = FALSE }
knitr::opts_chunk$set(
  echo = TRUE, warning = FALSE
)
```


# Executive summary

The US is one of the few countries in which the right to bear arms is constitutionally protected. A consequence of this is that the level of gun voilence is very high compared to other countries. After analyzing the dataset, which was downloaded from <a href="https://www.gunviolencearchive.org//"target="_blank">gunviolencearchive.org</a>, my main findings are:

- The trend is unfortunately upward. The number of incidents increased year-on-year from 2013 to 2017. The number of incidents in Q1 2018 is lower that this number in Q1 2017, which can be seen as a cautiously positive sign.
-On average, the number of incidents is lower in autumn/winter than in spring/summer. Weekends are more dangerous than weekdays, but this was obvious to me. The most dangerous dates are January 1st (New Year’s Day), and July 4th (Independence Day, with many incidents also happening the day after, July 5th).

- I enriched the data with population figures from the US Census Bureau. After having done that, Alaska came out as the (to me surprising) state with the relatively highest number of incidents. Besides Alaska, Louisiana and Delaware are also states with high average incident numbers. The safest state regarding the relative number of incidents is Hawaii. However, Alaska has a relatively low ratio of victims (number of people killed + injured) per incident. This seems to be caused by a high percentage of “non-shooting incidents” (I assume this mostly means incidents where people have only threatened to shoot). When ranking the states on the number of victims relative to the population size, Louisiana is the most dangerous state. In this statistic, Illinois jumps to second place, which is due to the state having the highest ratio of victims per incident.

- Of the big cities with at least 600,000 inhabitants, Baltimore, Washington DC, and Chicago have the highest relative numbers of incidents. Big cities are generally more dangerous than the country side regarding gun violence, as the incident rates are (much) higher.

- The most severe incident by far was the shooting at the Strip in Las Vegas (from the Mandalay Bay hotel), with a total of 470 people that were killed or injured.

- The incident characteristics offer a wealth of information. Altogether 110 categories were used to describe the incidents. On a high level, it for instance explains the low victims/incidents ratio of Alaska. However, categories such as terrorism, gang, and drug involvement were also recorded.
    - I have included a map that summarizes drug related incidents. It turns out that Baltimore has many drug related incidents when compared to the size of the city (Internet quote: In a city of 645,000, the Baltimore Department of Health estimates there are 60,000 drug addicts, with as many as 48,000 of them hooked on heroin).
    - Regarding gang related accidents, Chicago is the city with by far the most incidents.
    - 30 incidents were labeled as “Terrorism involvement”, and I displayed those on a map too.
    
One of my goals of this analysis was to learn how to make interactive maps in R, and the my first interactive ```Leaflet``` maps are included in this version. In addition, I have also converted some ```ggplots``` to ```plotly``` objects, which enables interactive showing of labels with exact information.


# Introduction

Gun Violence Archive (GVA) is a not for profit corporation formed in 2013 to provide free online public access to information about gun-related violence in the United States. GVA states that America is an exceptional country when it comes to guns. It’s one of the few countries in which the right to bear arms is constitutionally protected. But America’s relationship with guns is unique in another crucial way: Among developed nations, the US is far and away the most homicidal. For instance, America has six times as many firearm homicides as Canada, and nearly 16 times as many as Germany.

The data in this dataset was downloaded from <a href="http://www.gunviolencearchive.org//"target="_blank">gunviolencearchive.org</a>. The CSV file contains data for all recorded gun violence incidents in the US between January 2013 and March 2018. The data consist of 240k incidents, with detailed information about each incident.


# Loading and Exploring Data


## Loading libraries required and reading the data into R

Loading R packages used besides base R.

```{r loadlib, echo=T, results='hide', message=F, warning=F}
library(knitr)
library(dplyr)
library(readr)
library(ggplot2)
library(tibble)
library(stringr)
library(gridExtra)
library(scales)
library(lubridate)
library(ggrepel)
library(leaflet)
library(rgdal)
library(splitstackshape)
```

Loading data and adding the large, missing Las Vegas incident in 2017 manually. The dataset poster explains why he took this incident out in this discussion topic: <a href="https://www.kaggle.com/jameslko/gun-violence-data/discussion/55307//"target="_blank">Las Vegas incident missing</a>.


```{r}
gun <-
    as.tibble(data.table::fread(
        str_c("gun-violence-data_01-2013_03-2018.csv"),
        header = TRUE,
        stringsAsFactors = FALSE,
        na.strings = c("NA", "")
    ))

lasvegas <-
    list(
        999999,'2017-10-01','Nevada','Las Vegas','Mandalay Bay 3950 Blvd S',
        59,411,'https://en.wikipedia.org/wiki/2017_Las_Vegas_shooting',
        'https://en.wikipedia.org/wiki/2017_Las_Vegas_shooting',
        NA,NA,NA,NA,NA,36.095,'Mandalay Bay Hotel',-115.171667,47,
        'Route 91 Harvest Festiva; concert, open fire from 32nd floor. 47 guns seized; TOTAL:59 kill, 441 inj, number shot TBD,girlfriend Marilou Danley POI',
        NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
    )

gun <- rbind(gun, lasvegas)
```


## Data size and structure

```{r}
glimpse(gun)
```

Many variables need to be reformatted before they can be used. I chose to do this by variable as soon as I need the variable for a particular exploration.


# Data exploration

## Comparing number of incidents by year, quarter, month, and weekday

First of all, I need to reformat the Date variable.

```{r}
gun$date <- ymd(gun$date)
str(gun$date)
```
```{r}
summary(gun$date)
```

### By year

It seems as if most incidents in 2013 were not recorded. In addition, 2018 is not a full year of data as the latest recorded incident was on May 31st 2018. As this means exactly one quarter, I am interested if the trend is still upward if I only look at the Q1s of 2014-2018. I will look into this in the next section.

```{r}
gun$year <- year(gun$date) #extract year from date using the lubridate year() function


gun %>%
    ggplot(aes(x = as.factor(year))) +
    geom_bar(stat = 'count', fill = 'red') +
    scale_y_continuous(labels = comma) +
    geom_label(stat = "count", aes(label = ..count.., y = ..count..))

```


### By quarter; what does Q1 2018 tell us?

There seems to be some sort of ‘seasonality’ with Q1 and Q4 generally having lower numbers of incidents than Q2 and Q3. The second graph shows that Q1 2018 holds less incidents than Q1 2017. This could be seen as a cautiously positive sign. However, one should realize that this number is still very high when compared to other countries (relatively).

```{r}
gun$quarter <- quarter(gun$date) #extract Quarters from date

q1 <-
    gun %>% filter(year != 2013) %>% 
    select(year, quarter) %>% group_by(year) %>% 
    count(quarter) %>%
    ggplot(aes(x = as.factor(quarter), y = n)) + 
    geom_bar(stat = 'identity', fill = 'red') +
    scale_y_continuous(labels = comma) + 
    facet_grid(. ~ year) + labs(x ='Quarter', y = 'Number of incidents')

q2 <-
    gun %>% filter(year != 2013 & quarter == 1) %>% 
    select(year, quarter) %>%
    group_by(year) %>% 
    count(quarter) %>%
    ggplot(aes(x = as.factor(year), y = n)) + 
    geom_bar(stat = 'identity', fill ='red') +
    scale_y_continuous(labels = comma) + 
    labs(x = 'Incidents in Q1 of each year', y ='Number of incidents')

grid.arrange(q1, q2)
```


### By month

The analysis of quarters shows that more incidents occur in the warmer spring and summer periods. This seems worth diving into a little deeper. In order to compare months I will exclude 2018, as only the first three months have been recorded.

The most visible ‘seasonality’ effect seems to me that the colder months seem to have less incidents. November, December, and February are the 3 months with the lowest number of incidents (February also only has 28 days of course). The exception seems to be January, which is worth investigating later on. My first idea is that possibly incidents on new years eve contribute to January having a high number of incidents.

The other peak is the July/August period. I think that the fact that many people go on holidays in this period is the most likely explanation.

```{r}
gun$month <- month(gun$date, label=TRUE)

#only taking the complete years 2014-2017
plotly::ggplotly(gun %>% 
                     filter(year!=c(2013, 2018)) %>% 
                     count(month) %>%
                     ggplot(aes(x=month, y=n)) + 
                     geom_bar(stat='identity', fill='red') +
                     scale_y_continuous(labels=comma) +
                     labs(x='Month', y='Number of incidents', title='Incidents by Month'))
```
**Hovering over the bars will show a label with the exact number for each month**


#### Dates with most incidents

As I mentioned new years eve as a possible explanation of the high January numbers, while the colder month Nov-Feb generally seem to hold less incidents, I wanted to check what the dates with most incidents are.


```{r}
gun$day <- day(gun$date)
gun <- gun %>% mutate(date2=paste(month, day))
kable(gun %>% 
          filter(year!=c(2013, 2018)) %>% 
          count(date2) %>% top_n(10) %>% 
          arrange(desc(n)) %>% 
          rename(date=date2, "total number of incidents"=n))
```

The numbers above are actually the totals of 4 dates, as they are aggregated over 4 years (for instance: 1-1-2014, 1-1-2015, 1-1-2016, 1-1-2017). With the average being 618 per date (the total number of incidents in 2014-2017 divided by 365 calendar days; 225598/365), there are not many dates that really stand out. Most of the dates in this top 10 seem ‘ordinary’ days in July/August. However, January 1st indeed partially explains the higher incidents numbers in January. In addition, independence day (July 4th) is also dangerous when it comes down to gun incidents. I assume that the high number on July 5th is due to people continuing to celebrate after midnight.


### By weekday

Weekends (Saturday and Sunday) are more dangerous than working days. My best guess is that this is due to most people not having to work on these days, and also very likely due to nightlife related violence. Violence happening on Friday nightlife will likely happen after midnight to a large extend, and therefore be recorded as event on Saturdays.

```{r}
gun$weekday <- wday(gun$date, label=TRUE)

gun %>% 
    count(weekday) %>%
    ggplot(aes(x=weekday, y=n)) + 
    geom_bar(stat='identity', fill='red') +
    scale_y_continuous(labels=comma) +
    labs(x='Weekday', y='Number of incidents', title='Incidents by Weekday')
```


## Comparing number of incidents and victims by location


### Incidents by State

First of all, I have to convert State into a factor variable. As I will also analyze city_or_county later on, I will convert them both at the same time.

```{r}
gun[, c('state', 'city_or_county')] <- lapply(gun[, c('state', 'city_or_county')], as.factor)

str(gun$state)
```
```{r}
str(gun$city_or_county)
```

Below, I am displaying the absolute numbers of incidents by state. However, these numbers should be related to the numbers of inhabitants to get a good view of the relative numbers of incidents. For instance, California is a state with a very large population. Therefore, it is no surprise to see California as the number two in absolute numbers.

```{r, warning=FALSE}
plotly::ggplotly(
    gun %>% 
        count(state) %>%
        ggplot(aes(x=reorder(state, n), y=n, fill=n, text=state)) +
        geom_bar(stat='identity', fill='red') + 
        coord_flip() +
        labs(x='', y='Number of incidents'),tooltip=c("text", "y"))
```
**Hovering over the bars shows a label with the absolute Number of Incidents for a State**


#### Incidents relative to the State population size

In order be able to relate the numbers above to the population sizes, I had to look for a good source. I found what I needed on the website of the US Census Bureau.

```{r}
fileUrl <-
    "https://www2.census.gov/programs-surveys/popest/datasets/2010-2017/state/asrh/scprc-est2017-18+pop-res.csv"
download.file(fileUrl, destfile = "C:/Users/onur-/Downloads/Homepage/Portfolio/Gun violence in the US/censusData.csv")

statesPop <- read_csv("censusData.csv")
statesPop <- statesPop %>%
    select(NAME, POPESTIMATE2017) %>%
    filter(!NAME %in% c("United States", "Puerto Rico", "Puerto Rico Commonwalth")) %>%
    rename(state = NAME) %>%
    mutate(state = as.factor(state))
```

Below, I am creating a small “states” dataframe, that eventually holds the number of incidents relative to the population of each state (per 100,000 inhabitants), and the tables shows the first 6 states in alphabetical order.

```{r}
incidentsByState <-
    gun %>% 
    group_by(state) %>% 
    summarize(stateIncidents = n())

incidentsByState <-
    left_join(incidentsByState, statesPop, by = "state")

incidentsByState$Per100000 <-
    round((
        incidentsByState$stateIncidents / incidentsByState$POPESTIMATE2017
    ) * 100000)

kable(head(incidentsByState))
```

In the figure below, you can see that the enriched data, which are related to the population of each state, paint a very different picture. As the numbers of incidents are related to the population sizes, these numbers now represent ‘real’ gun danger levels. To show this visually, I have used color codes. Red means a high danger level in terms of relative numbers of incidents, and yellow means that a state is relatively safe.

Now, Alaska, Louisiana and Delaware are showing the highest relative incident numbers. Hawaii seems the safest state to live in (regarding gun related incidents, earthquakes and volcano eruptions are a different story :-)), and the large state of California drops from second state in terms of absolute incidents to a low position when corrected for the large population.

```{r}
plotly::ggplotly(
    incidentsByState %>% 
        filter(state!="District of Columbia") %>%
        ggplot(aes(x=reorder(state, Per100000), y=Per100000, fill=Per100000, text=state)) +
        geom_bar(stat='identity') + 
        coord_flip() +
        labs(x='', y='Incidents per 100,000 inhabitants') + 
        scale_fill_gradient(low="yellow", high="red") +
        theme(legend.position="none"),tooltip=c("text", "y"))
```
**Hovering over the bars shows a label with the relative Number of Incidents for a State**

Note: actually the District of Columbia came out as the state with the highest incidents rate. In all honesty, the state “District of Columbia (DoC)” did not ring any bells for me. However, a quick internet search told me that it is not really a state but actually a “federal district” not related to any state and roughly equal to the city Washington DC (see also <a href="https://www.huffingtonpost.com/2012/12/17/dc-gun-violence-statistics_n_2316944.html//"target="_blank">Huffington Post</a>, <a href="https://www.tripsavvy.com/is-the-district-of-columbia-a-state-1038984//"target="_blank">Tripsavvy</a>). As these incidents have to assigned anyway, I understand that DoC is labeled as a state in the data. However, since Washington DC is already dealt with as a city in section 4.2.3.1, I decided to take it out of the graph above.

Alaska being the state with relatively most incidents came as a surprise to me! Of course, I wanted to find out where this high level of gun violence in Alaska might come from. The following link explains why that the data in this dataset are not lying: <a href="http://www.businessinsider.com/the-state-where-youre-most-likely-to-be-killed-by-a-gun-is-one-of-the-most-beautiful-places-on-earth-2015-6?international=true&r=US&IR=T//"target="_blank">Here’s why Alaska’s gun problem is so bad</a>. In summary, Alaska has very liberal gun laws, very high gun possession as a consequence (60% of homes), and also a very high level of suicides (80% of firearm deaths are self-inflicted).

To get an idea on how these relative figures are composed, I am displaying the underlying numbers of a few states below. As you can see, DoC and Alaska have few inhabitants but a relatively high number of incidents. Hawaii just has low numbers of incidents, and the large state of California comes out with a relatively low number of incidents when taking the population into account.


```{r message=FALSE, warning=FALSE}
kable(incidentsByState %>%
        filter(state %in% c('District of Columbia', 'Alaska', 'California', 'Hawaii')))
```


#### Is the high Alaska incidents rate statistically significant?

Heads or Tails raised the question if the higher figure for Alaska was significant giving the relatively small population size. To find that out, I added a 2 proportions z-test.

```{r}
incidents_alaska <- incidentsByState$stateIncidents[incidentsByState$state=='Alaska']
incidents_restUS <- sum(incidentsByState$stateIncidents[incidentsByState$state!='Alaska'])
n_alaska <- incidentsByState$POPESTIMATE2017[incidentsByState$state=='Alaska']
n_restUS <- sum(incidentsByState$POPESTIMATE2017[incidentsByState$state!='Alaska'])

prop.test(x=c(incidents_alaska, incidents_restUS), n=c(n_alaska, n_restUS))
```
The p-value of this test is very small, so we can conclude that the incidents rate in Alaska is significantly higher than the rate of the rest of the US. The sample estimates should be read as: prop1 is the Alaska estimate. If we multiply this number by 100,000 we are getting the 182 incidents per 100,000 inhabitants. The rest of the US has an average of 73 incidents per 100,000 (prop 2 * 100,000).

The confidence interval should be read as follows:

- the point estimate of the difference is 182-73=109
- with 95% confidence the real difference should be between 99 and 118
- the fact that this confidence interval does not include 0 also means that the difference is significant.

Since the statistical margins of error are reasonably small, about plus or minus 10 for a small state like Alaska, I am not going to include any further statistical test in this analysis.


#### An interactive map of incidents by state with Leaflet

As mentioned in the Executive Summary, one of my goals when taking on this dataset was to learn how to create interactive maps. I downloaded the spatial data from the US Census bureau. Below, I am using the package ```rgdal``` to read these spatial data.

```{r}
fileUrl2 <-
    "http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_state_500k.zip"
download.file(fileUrl2,
              destfile = "C:/Users/onur-/Downloads/Homepage/Portfolio/Gun violence in the US/us.zip")
path3 <- unzip("us.zip")
states <-
    readOGR(dsn = getwd(),
            layer = "cb_2017_us_state_500k",
            encoding = "UTF-8")
```

```{r}
class(states)
```

This spatial polygons dataframe also contains data elements, as you can see below.

```{r}
kable(head(states@data)) #please notice the different notation using `@` instead of the usual '$'
```

With this datastructure, a straightforward join (joining the Per100000 column) with additional data is not possible. However, I found out that an easy workaround is to extract data elements from the object, join this small dataframe with the info that I want to display on the map (the incidents per 100,000 inhabitants), and add this column in the ‘states’ object.

```{r}
addPer100k <- data.frame(id=states$GEOID, name=states$NAME)
addPer100k <- left_join(addPer100k, incidentsByState %>% select(state, Per100000) %>% rename(name=state), by="name")
addPer100k$Per100000[is.na(addPer100k$Per100000)] <- 0
states$per100k <- addPer100k$Per100000
```

Now it is time to create the map. Please feel free to hover over the states (this shows a popup with state name and the exact number of incidents per 100,000 inhabitants), and zoom in and out. By zooming out, you will be able to see Alaska and Hawaii. By zooming in, you can also find the District of Columbia (between Maryland and Virginia).

```{r}
bins <- c(0, 50, 75, 100, 150, Inf)
pal <- colorBin("YlOrRd", domain = states$per100k, bins = bins)

state_popup <- paste0(
    "<strong>State: </strong>",
    states$NAME,
    "<br><strong>Incidents per 100,000 inhabitants </strong>",
    states$per100k
) %>% lapply(htmltools::HTML)

leaflet(data = states) %>%
    setView(lng = -96, lat = 37.8, zoom = 4) %>%
    addProviderTiles("MapBox",
                     options = providerTileOptions(
                         id = "mapbox.light",
                         accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN')
                     )) %>%
    addPolygons(
        fillColor = ~ pal(per100k),
        weight = 2,
        opacity = 1,
        color = "white",
        dashArray = "3",
        fillOpacity = 0.7,
        highlight = highlightOptions(
            weight = 5,
            color = "#666",
            dashArray = "",
            fillOpacity = 0.7,
            bringToFront = TRUE
        ),
        label = state_popup,
        labelOptions = labelOptions(
            style = list("font-weight" = "normal", padding = "3px 8px"),
            textsize = "15px",
            direction = "auto"
        )
    ) %>%
    addLegend(
        pal = pal,
        values = ~ per100k,
        opacity = 0.7,
        title = "Incidents",
        position = "bottomright"
    )
```
**Please hover over a state to show the details in a popup**


### Victims by State

First of all, I will create a new variable that adds up the number of killed and injured people for each incident.

```{r}
gun$victims <- gun$n_killed + gun$n_injured
```


#### Severity of Incidents

As you can see in the table above, Alaska actually has a relatively low number of victims per incident (0.44). However, as shown in the graph, Louisiana also ranks high with regards to this ratio.

```{r}
VictimsByState <-
    gun %>% group_by(state) %>% summarize(
        sumVic = sum(victims),
        sumInj = sum(n_injured),
        sumDeath = sum(n_killed),
        PercDeath = round(sumDeath / sumVic, 2),
        sumIncidents = n(),
        vicPerInc = round(sumVic / sumIncidents, 2)
    )
head(VictimsByState)
```

```{r}
VictimsByState %>% 
    filter(vicPerInc>0.8) %>%
    ggplot(aes(x=reorder(state, -vicPerInc), y=vicPerInc)) + 
    geom_bar(stat='identity', fill='red') +
    labs(x='State', y='Victims per incidents') +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
```


#### Victims relative to the state population sizes

In the graph below, which again uses the color codes to show danger levels, you can see that Louisiana is the most dangerous state when it comes down to the number of victims relative to the population size. This is no surprise, as the state has a high relative number of incidents and also ranks high regarding the victims per incident ratio. Alaska drops to 9th place, which is due to the relatively low number of victims per incident.


```{r}
VictimsByState <- left_join(VictimsByState, statesPop, by = "state")
VictimsByState$Per100000 <-
    round((VictimsByState$sumVic / VictimsByState$POPESTIMATE2017) * 100000)


plotly::ggplotly(
    VictimsByState %>% 
        filter(state != "District of Columbia") %>%
        ggplot(aes(
            x = reorder(state, Per100000),
            y = Per100000,
            fill = Per100000,
            text = state
        )) +
        geom_bar(stat = 'identity') + 
        coord_flip() +
        labs(x = '', y = 'Victims per 100,000 inhabitants') + 
        scale_fill_gradient(low ="yellow", high = "red") +
        theme(legend.position = "none"),tooltip = c("text", "y")
)
```
**Hovering over the bars shows a label with the relative Number of Victims for a State**


#### An interactive map of victims by state with Leaflet

Making a leaflet of these figures is now easy, as I can reuse most code. All I need to do is replace the incident figures by the victim numbers.

```{r}
addPer100k <- addPer100k %>% select(id, name)
addPer100k <-
    left_join(addPer100k,
              VictimsByState %>% select(state, Per100000) %>% rename(name = state),
              by = "name")
addPer100k$Per100000[is.na(addPer100k$Per100000)] <- 0
states$per100k <- addPer100k$Per100000

bins <- c(0, 25, 50, 75, 100, Inf)
pal <- colorBin("YlOrRd", domain = states$per100k, bins = bins)

state_popup <- paste0(
    "<strong>State: </strong>",
    states$NAME,
    "<br><strong>Victims per 100,000 inhabitants </strong>",
    states$per100k
) %>% lapply(htmltools::HTML)

leaflet(data = states) %>%
    setView(lng = -96, lat = 37.8, zoom = 4) %>%
    addProviderTiles("MapBox",
                     options = providerTileOptions(
                         id = "mapbox.light",
                         accessToken = Sys.getenv('MAPBOX_ACCESS_TOKEN')
                     )) %>%
    addPolygons(
        fillColor = ~ pal(per100k),
        weight = 2,
        opacity = 1,
        color = "white",
        dashArray = "3",
        fillOpacity = 0.7,
        highlight = highlightOptions(
            weight = 5,
            color = "#666",
            dashArray = "",
            fillOpacity = 0.7,
            bringToFront = TRUE
        ),
        label = state_popup,
        labelOptions = labelOptions(
            style = list("font-weight" = "normal", padding = "3px 8px"),
            textsize = "15px",
            direction = "auto"
        )
    ) %>%
    addLegend(
        pal = pal,
        values = ~ per100k,
        opacity = 0.7,
        title = "Victims",
        position = "bottomright"
    )
```
**Please hover over a state to show the details in a popup**


### Incidents by city

It seems likely that a lot of violence occurs in the big cities of the states. For instance, I expected to see a lot incidents in the big cities of Illinois (Chicago) and Louisiana (New Orleans). The biggest city in Alaska (Anchorage) is unlikely to appear in the list of cities with the highest absolute numbers of incidents as the city is relatively small.

```{r}
incidentsByCity <- 
    gun %>% 
    select(city_or_county, state) %>% 
    rename(city=city_or_county) %>% 
    group_by(city, state) %>% 
    summarize(cityIncidents=n())
```

New York City did not seem to be recorded as a whole in the dataset. It turns out that the boroughs are recorded separately.

```{r}
incidentsByCity[(incidentsByCity$city %in% c('Brooklyn', 'Bronx', 'Queens', 'Staten Island','New York (Manhattan)')) & incidentsByCity$state=='New York',]
```

Before I continue, I am adding New York as a city.

```{r}
sumNewYork <- sum(incidentsByCity$cityIncidents[(incidentsByCity$city %in% c('Brooklyn', 'Bronx', 'Queens', 'Staten Island','New York (Manhattan)')) & incidentsByCity$state=='New York'])

NewYork <- data.frame(city='New York', state='New York', cityIncidents=sumNewYork)
incidentsByCity <- as.tibble(rbind(as.data.frame(incidentsByCity), NewYork))
```

```{r}
incidentsByCity %>% 
    top_n(20, wt=cityIncidents) %>%
    ggplot(aes(x=reorder(city, cityIncidents), y=cityIncidents)) + 
    geom_bar(stat='identity', fill='red') +        
    labs(x='City', y='Number of incidents') + 
    coord_flip()
```

Although I expected Chicago to have a lot of incidents, I am still surprised to see that it holds the ‘number one spot’ by a huge margin! In the next section, I will relate those absolute numbers to the population by city.


#### Incidents relative to the City population size

```{r}
fileUrl3 <-
    "https://raw.githubusercontent.com/Haykall/superhero-API/master/uscitiesv1.4.csv"
download.file(fileUrl3,
              destfile = "C:/Users/onur-/Downloads/Homepage/Portfolio/Gun violence in the US/uscitys.csv")

citiesPop <- read_csv("uscitys.csv", col_types = cols())
citiesPop <-
    citiesPop %>% 
    select(city, state_name, population_proper) %>% 
    rename(state =state_name, population = population_proper) %>% 
    filter(population > 600000)
```

As this is a large file, I decided to only focus on the bigger cities. Selecting all cities with at least 600,000 inhabitant just includes Baltimore; the number 2 in absolute numbers of incidents.

```{r}
citiesPop <-left_join(citiesPop, incidentsByCity, by=c("city", "state"))
citiesPop$Per100000 <- round((citiesPop$cityIncidents/citiesPop$population)*100000)
citiesPop$citystate <- str_c(citiesPop$city, " - " ,citiesPop$state)

incidentsByState <- incidentsByState %>% rename(state_avg=Per100000)
citiesPop <- left_join(citiesPop, incidentsByState %>% select(state, state_avg), by="state")

citiesPop1 <- 
    citiesPop %>% 
    select(citystate, Per100000, state, state_avg) %>% 
    rename(city_avg=Per100000)

gathercols <- c("city_avg", "state_avg")
CitiesStatesLong <- tidyr::gather(citiesPop1, city_or_state, per100k, gathercols)
citiesTop20 <- 
    CitiesStatesLong %>% 
    filter(city_or_state=='city_avg') %>% 
    arrange(per100k) %>% 
    top_n(20, wt=per100k)

Top20names <- citiesTop20$citystate
CitiesStatesLong <- CitiesStatesLong[CitiesStatesLong$citystate %in% Top20names,]
```

As you can see, the city averages of the cities with high incident numbers are much higher than their associated state averages. At the bottom of this list (not displayed) a few cities have lower numbers than the state , but generally cities are more dangerous than the country side.

```{r}
ggplot(
    CitiesStatesLong, aes(x=factor(citystate), y=per100k, fill=city_or_state)) +
    geom_bar(stat="identity", position = position_dodge2(reverse=TRUE, padding=0.1)) + 
    coord_flip() +
    scale_fill_manual(values = c("state_avg"="yellow", "city_avg"="red")) +
    scale_x_discrete(limits=Top20names) + 
    labs(y='Incidents per 100,000 inhabitants', x="")
```


## Incidents with highest numbers of victims

Below, I am displaying the 13 incidents with the highest numbers of victims. Although any single victim is one too many, the shooting in Las Vegas was by far the worst incident with over 470 victims.

```{r}
Top10 <- 
    gun %>% 
    select(incident_id, date, n_killed, n_injured, victims, location_description, city_or_county, state, latitude, longitude) %>% 
    rename(Incident_Id=incident_id, Date=date, Killed=n_killed, Injured=n_injured, Victims=victims, Location=location_description, City=city_or_county) %>%
    arrange(desc(Victims)) %>% 
    top_n(n=13, wt=Victims)
```

```{r}
kable(Top10 %>% select(-longitude, -latitude))
```


### An interactive map of the incidents with highest numbers of victims

```{r}
TopMap <- Top10 %>% select(latitude, longitude, Victims, City, Location)

labels <- paste0("<strong>City: </strong>", TopMap$City, 
                 "<br><strong>Location: </strong>", TopMap$Location,
                 "<br><strong>Victims </strong>", TopMap$Victims) %>% lapply(htmltools::HTML)

leaflet(TopMap) %>%
        setView(lng=-96, lat=37.8, zoom=4) %>%
        addTiles() %>%
        addProviderTiles("CartoDB.Positron") %>%
        addCircleMarkers(~longitude, ~latitude, color = "red", radius=~sqrt(Victims), label = labels)
```
**Please hover over an incident to show a label with the City, Location and the Number of Victims**


## Analyzing the Incident Characteristics

As you can see below, the ‘incident_characteristics` field actually holds a lot in information. It seems as if the person who entered the info had the choice to select multiple prescribed categories. These are separated by ’||’ in the incident_characteristics column. The first two incidents are both describes by 4 categories, but this varies among the data.

```{r}
head(gun$incident_characteristics,2)
```

The function ```cSplit``` of package ```splitstackshape``` works perfectly to split this column and store all characteristics as separate observations in a new dataframe. Besides the incident_id, I am also keeping the state and city, as I also want to see the specifics for for instance Alaska and Baltimore.


```{r}
#replacing "||" with "|" as both separators are used
gun$incident_characteristics <- gsub("\\|\\|", "|", gun$incident_characteristics)

IncCharac <- splitstackshape::cSplit(
    gun %>% 
        select(incident_id, state, city_or_county, incident_characteristics), 'incident_characteristics', sep =  '|', direction="long")

numCat <- round(nrow(IncCharac)/nrow(gun),1)
cat('On average, there are', numCat, 'incident categories specified per incident')
```

Below, you can see that the first two incidents are now split nicely into 8 observations.


```{r}
kable(head(IncCharac,8))
```


### Incident categories in the US

As you can see, there are lots of different categories. These do not necessarily all involve shots being fired. I for instance assume that ‘Non-Shooting Incident’ means that people have just threatened to shoot, or possible have used their gun as a striking weapon.

```{r}
IncCharac %>% 
    count(incident_characteristics) %>% 
    top_n(30, wt=n) %>%
    ggplot(aes(x=reorder(incident_characteristics, n), y=n)) +
    geom_bar(stat='identity', fill='red') +
    coord_flip() + 
    labs(x='Incident Category', y='number of incidents')
```

As the first 4 categories seem overall categories, I will display numbers on these 4 categories separately in the remainder of this section.

```{r}
overallCats <- c("Shot - Wounded/Injured", "Shot - Dead (murder, accidental, suicide)", "Non-Shooting Incident", "Shots Fired - No Injuries")
```

```{r}
TableOverallCats <- 
    IncCharac %>% 
    filter(incident_characteristics %in% overallCats) %>% 
    count(incident_characteristics)

cat('For', round((sum(TableOverallCats$n)/nrow(gun))*100), 'percent of incidents, an overall category is specified')
```

In my opinion, this is close enough to 100%. I am going to assume that the distribution over those 4 categories for the remaining 5% of incidents is roughly the same.


#### Comparing the main incident categories by state

First of all, I am going to see if there are any significant differences when comparing those categories for the three states with the highest incident rates with the distribution within the US overall.

```{r}
coloursShot <- c("Shot - Wounded/Injured"="orange", "Shot - Dead (murder, accidental, suicide)"="red", "Non-Shooting Incident"="green", "Shots Fired - No Injuries"="yellow")

#creating a function to vary the x-axis scale (next plot uses same graph with different scale)
usCats <- function(fixedX=0.5){
   IncCharac %>% filter(incident_characteristics %in% overallCats) %>%
   count(incident_characteristics) %>%
   ggplot(aes(x=reorder(incident_characteristics, n), y=n/sum(n), fill=factor(incident_characteristics))) +
   geom_bar(stat='identity', width = 0.5) + scale_fill_manual(values = coloursShot) +
   theme(legend.position="none") + coord_flip(ylim = c(0, fixedX)) + labs(x="", y='US overall') +
   scale_y_continuous(labels=percent)
}

#creating a function to create plots by state
stateCats <- function(stateName){
   IncCharac %>% filter(state==stateName & incident_characteristics %in% overallCats) %>%
   count(incident_characteristics) %>%
   ggplot(aes(x=reorder(incident_characteristics, n), y=n/sum(n), fill=factor(incident_characteristics))) +
   geom_bar(stat='identity', width = 0.5)+ scale_fill_manual(values = coloursShot) +
   theme(legend.position="none") + coord_flip(ylim = c(0, 0.5)) + labs(x="", y=stateName) +
   scale_y_continuous(labels=percent)
}

usOverallCats <- usCats()
alaskaCats <- stateCats('Alaska')
delawareCats <- stateCats('Delaware')
louisianaCats <- stateCats('Louisiana')

grid.arrange(usOverallCats, alaskaCats, delawareCats, louisianaCats, ncol=1)
```

Regarding the states with relatively most incidents, we can say that compared to the US averages:
- In Louisiana, incidents are more severe with both higher percentages of incidents with people wounded and dead
- In Delaware, there were more incidents with people wounded, but also less incidents with people killed
- In Alaska, incidents were less severe with less incidents with wounded people and more incidents with no-shooting


#### Comparing the main incident categories by city

```{r}
#creating a function to create plots by city
cityCats <- function(cityName){
   IncCharac %>% 
        filter(city_or_county==cityName & incident_characteristics %in% overallCats) %>%
        count(incident_characteristics) %>%
        ggplot(aes(x=reorder(incident_characteristics, n), y=n/sum(n), fill=factor(incident_characteristics))) +
        geom_bar(stat='identity', width = 0.5) + 
        scale_fill_manual(values = coloursShot) +
        theme(legend.position="none") + 
        coord_flip(ylim = c(0, 0.8)) + 
        labs(x="", y=cityName) +
        scale_y_continuous(labels=percent)
}

usOverallCats <- usCats(0.8)
baltimoreCats <- cityCats('Baltimore')
washingtonCats <- cityCats('Washington')
chicagoCats <- cityCats('Chicago')

grid.arrange(usOverallCats, baltimoreCats, washingtonCats, chicagoCats, ncol=1)
```

Regarding the cities with relatively most incidents, we can say that:
- In Baltimore the percentage of incidents with people wounded is high (20% higher than US average). In addition, there are very few incidents with shots fired and people not injured.
- In Washington DC, despite having a high average of incidents per 100,000 incidents, incidents are on average less severe than the US average. Relatively, there are more Non-Shooting incidents, and less incidents with deadly victims.
- In Chicago, the percentage of incidents with wounded people is very high. On the other hand, the percentage of incidents with deadly victims is a bit lower than the US average.


### Opportunities offered by the smaller incident categories

The incident categories offer a wealth of information, and actually 110 categories are recorded in the database. Below, I am displaying the first 10 categories alphabetically.

```{r}
CatTable <- IncCharac %>% count(incident_characteristics)
kable(CatTable[1:10,])
```


#### Sub categories of the deadly incidents

The overall category “Shot - Dead (murder, accidental, suicide)” for instance consists of multiple sub categories.

```{r}
Dead <- c("Shot - Dead (murder, accidental, suicide)", "Accidental Shooting - Death", "Murder/Suicide", "Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location)", "Suicide^", "Attempted Murder/Suicide (one variable unsuccessful)")
CatTable %>% filter(incident_characteristics %in% Dead)
```

I assume that murder/suicide means that the person who committed suicide first killed other people. If I have the time, I will check if this list of sub categories is complete, and include a visualization. Another sub category that is worth investigating is “Suicide”, as for instance the article on Alaska suggests a high suicide rate in this state.


#### Terrorism, gang involvement, and drug involvement

Other sub categories that can be visualized well with leaflet maps are terrorism, gang, and drugs involvement.

```{r}
Involvement <- c("Terrorism Involvement", "Drug involvement", "Gang involvement")
kable(CatTable %>% filter(incident_characteristics %in% Involvement))
```


**Drug involvement:**

I chose to first make a map of drug involvement, as this forces me to use clusting of incidents to make the map readable. After having experimented with zoom levels, I decided to zoom in on the Baltimore/Washington DC area, as especially Baltimore had a high number drug related incidents. A quick internet search learned me that this is correct, see also Baltimore Is the U.S. <a href="https://abcnews.go.com/US/story?id=92699&page=1/" target="_blank">Heroin Capital</a>.

Of course, you can zoom out but this has the disadvantage that areas become so large that multiple cities are aggregated. Swiping through the map while keeping the zoom level the same gives the best overview on the “city” level in my opinion. Clicking on the clusters automatically zooms in on that particular area, and breaks down a cluster into smaller clusters. You can continue to do so until you see all the individual incidents.

```{r message=FALSE, warning=FALSE}
Drugs <- IncCharac %>% filter(incident_characteristics=="Drug involvement")
Drugs <- left_join(Drugs, gun %>% select(incident_id, longitude, latitude, location_description, victims), by="incident_id")

labels <- paste0("<strong>City: </strong>", Drugs$city_or_county, 
                 "<br><strong>Location: </strong>", Drugs$location_description,
                 "<br><strong>Victims </strong>", Drugs$victims) %>% lapply(htmltools::HTML)

leaflet(Drugs) %>%
        setView(lng=-76.6, lat=39.3, zoom=8) %>%
        addTiles() %>%
        addProviderTiles("CartoDB.Positron", group="Light map") %>%
        addProviderTiles("Esri.NatGeoWorldMap", group= "Green map") %>%
        addScaleBar %>%
        addMarkers(~longitude, ~latitude,
                label = labels,  
                clusterOptions = markerClusterOptions()) %>%
    addLayersControl(baseGroups = c("Green map", "Light map"), options = layersControlOptions(collapsed = FALSE))
```


**Gang involvement:**

The map in this section is zoomed in on Chicago, as this city has by far the most incidents with gang involvement. Please also have a look at this article: <a href="https://www.therichest.com/rich-list/most-shocking/the-most-gang-infested-cities-in-america/" target="_blank">The 6 Most Gang Infested Cities in America</a>

> In recent years Chicago has made a dramatic leap to overcome Los Angeles as America’s gang capital with a staggering 150,000 gang members and become one of the most violent cities in the country.

Clicking on the clusters automatically zooms in on that particular area, and breaks down a cluster into smaller clusters. You can continue to do so until you see all the individual incidents.


```{r message=FALSE, warning=FALSE}
Gangs <- IncCharac %>% filter(incident_characteristics=="Gang involvement")
Gangs <- left_join(Gangs, gun %>% select(incident_id, longitude, latitude, location_description, victims), by="incident_id")

labels <- paste0("<strong>City: </strong>", Gangs$city_or_county, 
                 "<br><strong>Location: </strong>", Gangs$location_description,
                 "<br><strong>Victims </strong>", Gangs$victims) %>% lapply(htmltools::HTML)

leaflet(Gangs) %>%
        setView(lng=-87.6, lat=41.9, zoom=6) %>%
        addTiles() %>%
        addProviderTiles("CartoDB.Positron", group="Light map") %>%
        addProviderTiles("Esri.NatGeoWorldMap", group= "Green map") %>%
        addScaleBar %>%
        addMarkers(~longitude, ~latitude,
                label = labels,  
                clusterOptions = markerClusterOptions()) %>%
    addLayersControl(baseGroups = c("Green map", "Light map"), options = layersControlOptions(collapsed = FALSE))
```


**Terrorism**

The 30 terrorism incidents are shown on the map below. Hovering over an incident shows a label with the city, location, and the number of victims.

```{r message=FALSE, warning=FALSE}
Terror <- IncCharac %>% filter(incident_characteristics=="Terrorism Involvement")
Terror <- left_join(Terror, gun %>% select(incident_id, longitude, latitude, location_description, victims), by="incident_id")

labels <- paste0("<strong>City: </strong>", Terror$city_or_county, 
                 "<br><strong>Location: </strong>", Terror$location_description,
                 "<br><strong>Victims </strong>", Terror$victims) %>% lapply(htmltools::HTML)

leaflet(Terror) %>%
        setView(lng=-96, lat=37.8, zoom=4) %>%
        addTiles() %>%
        addProviderTiles("CartoDB.Positron") %>%
        addCircleMarkers(~longitude, ~latitude, color = "red", radius=~sqrt(victims), label = labels)
```