This project looks at police stops in Cincinnati, Columbus, and Cleveland Ohio as part of Stanford University’s open policing project. Standford developed the base R script for this, including the veil of darkness test, and gathered some of the records. A team of Ohio journalists gathered additional records and worked for eights months to add additional analysis, such as the point in polygon part, and to conduct interviews, make graphics, etc.
This is NOT nor does it claim to be an official study. But everything is accurate in here to the best of our knowledge after checking with multiple sources at the time of publication.
First, we are comparing the population of each city, and not the drivers in each city, which will be slightly different. The The Ohio Dept. of Public Safety, which houses the BMV, does not break down drivers by race, which means Census data was the best proxy we could find.
Second, this doesn’t account for people driving into and out of the city. Getting a ticket there doesn’t necessarily mean you live there.
A note on race: Race is counted two different ways. For a full discussion on this, see The Curious Journalist’s guide to data.
Essentially, “Hispanic” is considered an “ethnicity” and not a “race.” So you can be white and Hispanic or black and hispanic. (This also points to the stupidity of treating someone differently because of their race, which of course is the whole story, as we can’t even define it really.)
For Cincinnati and Columbus, the actual number of hispanics was quite low, so we went with the definiton “non-hispanic white.” In Cleveland, where the population is 11 percent Hispanic, police didn’t mark the race of anyone with a ticket as “Hispanic.” (Though curiously they did mark 200 as “Arabic.”) So to more accurately portray Cleveland, we went with “White” including hispanics.
This doesn’t really affect our analysis though because:
there were very few Hispanic Blacks in the census
it’s not possible to tell from the data who is hispanic anyway.
So this data probably does not really tell us anything about Hispanics, and it’s not possible to draw any conculsions from their results.
Other caveats:
We are talking about census blocks here, so the smaller the number counted, the larger the margin of error.
Also, geocoding has a block margin of error, since the Cincinnati police code stops as 12XX Main Street. This isn’t a huge error, but it gets larger when we’re talking about a small area like a census block group.
Every large datatset will have a few errors, and ours was no exception. We had to exclude a small subset of stops that were not propertly geocoded, and we had to exclude a small subset of stops with missing information.
The veil of darkess test assumes that drivers behave the same way at night and during the day. There is some evidence that this is not always the case. See, for example, here. See Veil of Darkness full paper for a more in-depth analysis.
Thank you to the following for providing ggmap: D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1),144-161. link
loading libraries and data
## Libraries to include
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(dplyr)
library(lutz)
library(suncalc)
library(splines)
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(tabulizer)
library(pdftables)
library(hms)
##
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
##
## hms
library(DT)
## Load the data
# Replace the path below with the path to where your data lives on your computer that you are using this script in
#data_path <- "~/Code/SunlightStops/SunlightStopsStory/oh_cincinnati_2019_02_25.csv"
#data_path <- "~/Code/SunlightStops/SunlightStopsStory/CinStopsByBlock2017.csv"
#CINstops <- read_csv(data_path)
#table(CINstops$subject_race)
CINstops <- rio::import("https://stacks.stanford.edu/file/druid:hp256wp2687/hp256wp2687_oh_cincinnati_2019_08_13.csv.zip")
#CINstops <- rio::import("CinStopsWithRaceByBlock.xlsx")
# Additional data and fixed values we'll be using
population_2016 <- tibble(
subject_race = c(
"asian/pacific islander", "black", "hispanic", "other/unknown","white"
),
CINnum_people = c(5418, 127615, 9554, 8489, 145512) #This is where I got this data: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=CF
) %>%
mutate(subject_race = as.factor(subject_race))
#Have to add in the latitude and longitude of wherever you are (this is slightly different for Columbus)
CINcenter_lat <- 39.1031
CINcenter_lng <- -84.5120
What this dataset offers:
colnames(CINstops)
## [1] "raw_row_number" "date"
## [3] "time" "location"
## [5] "lat" "lng"
## [7] "neighborhood" "beat"
## [9] "subject_race" "subject_sex"
## [11] "officer_assignment" "type"
## [13] "disposition" "arrest_made"
## [15] "citation_issued" "warning_issued"
## [17] "outcome" "reason_for_stop"
## [19] "vehicle_make" "vehicle_model"
## [21] "vehicle_registration_state" "vehicle_year"
## [23] "raw_race" "raw_action_taken_cid"
## [25] "raw_field_subject_cid"
How many stops do we have in our dataset?
nrow(CINstops)
## [1] 315281
summary(CINstops)
## raw_row_number date time
## Length:315281 Length:315281 Length:315281
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## location lat lng neighborhood
## Length:315281 Min. :-84.69 Min. :-97.60 Length:315281
## Class :character 1st Qu.: 39.11 1st Qu.:-84.54 Class :character
## Mode :character Median : 39.13 Median :-84.52 Mode :character
## Mean : 38.23 Mean :-83.62
## 3rd Qu.: 39.15 3rd Qu.:-84.50
## Max. : 45.53 Max. : 39.27
## NA's :261 NA's :261
## beat subject_race subject_sex
## Length:315281 Length:315281 Length:315281
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## officer_assignment type disposition arrest_made
## Length:315281 Length:315281 Length:315281 Mode :logical
## Class :character Class :character Class :character FALSE:269176
## Mode :character Mode :character Mode :character TRUE :45879
## NA's :226
##
##
##
## citation_issued warning_issued outcome reason_for_stop
## Mode :logical Mode :logical Length:315281 Length:315281
## FALSE:132897 FALSE:258896 Class :character Class :character
## TRUE :182158 TRUE :56159 Mode :character Mode :character
## NA's :226 NA's :226
##
##
##
## vehicle_make vehicle_model vehicle_registration_state
## Length:315281 Length:315281 Length:315281
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## vehicle_year raw_race raw_action_taken_cid
## Min. : 1932 Length:315281 Length:315281
## 1st Qu.: 1998 Class :character Class :character
## Median : 2002 Mode :character Mode :character
## Mean : 2002
## 3rd Qu.: 2006
## Max. :16998
## NA's :2922
## raw_field_subject_cid
## Length:315281
## Class :character
## Mode :character
##
##
##
##
Because the date is a character here, and not a date, so we’re changing it to a date format.
CINstops <- filter(CINstops, !is.na(date))
CINstops$date <- as.Date(CINstops$date, format = "%Y-%m-%d")
max(CINstops$date)
## [1] "2018-05-28"
Putting times in a time vector.
library(hms)
CINstops$time<-parse_hms(CINstops$time)
Looking at stops per year
CINstops %>%
group_by(year(date)) %>%
summarize(Total=n())
## # A tibble: 10 x 2
## `year(date)` Total
## <dbl> <int>
## 1 2009 54630
## 2 2010 52664
## 3 2011 43446
## 4 2012 37876
## 5 2013 25646
## 6 2014 25440
## 7 2015 25372
## 8 2016 21733
## 9 2017 20415
## 10 2018 8059
CINstops <- CINstops %>%
filter(year(date) < 2018, year(date) > 2008) #Cincy only has part of 2018 and the data from before 2009 is spotty so we filtered that out
If you just want to look at minor stops.
CINstops %>%
group_by(reason_for_stop) %>%
summarize(Total=n())
## # A tibble: 146 x 2
## reason_for_stop Total
## <chr> <int>
## 1 <NA> 276001
## 2 --- DISCON CALL 2
## 3 ABDUCTION 1
## 4 ACCIDENT NO INJURIES 53
## 5 ACCIDENT WITH INJURIES 1
## 6 ACCIDENT-VEH INTO BLDG 1
## 7 ADVISED COMPLAINT 22
## 8 ADVISED INCIDENT 9
## 9 ANIMAL COMPLAINT 7
## 10 ASSAULT IN PROGRESS 13
## # … with 136 more rows
#unique(CINstops$reason_for_stop)
Taking out anything that doesn’t sound like a traffic stop:
“911 DISCON CALL” “ACCIDENT NO INJURIES” “ACCIDENT WITH INJURIES” “ACCIDENT-VEH INTO BLDG” “ADVISED COMPLAINT”
“ADVISED INCIDENT” “ANIMAL COMPLAINT” “ASSAULT IN PROGRESS” “ASSAULT J/O-NO INJS” “ASSAULT REPORT”
“ATTEMPT TO LOCATE” “AUTO ACC INJ-POL ONL” “AUTO ACCIDENT - NO I” “AUTO ACCIDENT INJURI” “AUTO CRASH INTO BUIL”
, “AUTO THEFT IN PROGRE” “AUTO THEFT J/O” “AUTO THEFT OR RECOVE” “AUTO THEFT REPORT” “B&E REPORT”
,“BACKUP REQUIRED” “BE IN PROG/JO” “BREAKING & ENTERING IN PROGRESS” “CAR IN VIOLATION” “CELL DISCON OR SICAL”
, “CHILD-VICTIM OF CRIM” “COMPLAINT OF PANHAND” “COMPLAINT OF PROSTIT” “CRIMINAL DAMAGING” “CRIMINAL DAMAGING IN”
, “CRIMINAL DAMAGING RE” “CURFEW VIOLATION” “CUTTING IN PROG/JO” “DEFAULT INCIDENT TYP” “DIRECT ALARM”
“DOMESTIC VIOLENCE” “DOMESTIC VIOLENCE J/O” “DRUG ACTIVITY”
, “DRUG COMPLAINT NOT I” “FAMILY TROUBLE” “FAMILY TROUBLE - NON VIOLENT” “FIGHT IN PROGRESS” “FIRE POLICE REQUEST”
, “FIREWORKS” “GENERAL INFO BROADCAST” “GENERAL INFORMATION-” “HEROIN OD” “KEYS LOCKED IN AUTO”
, “LARGE GROUP ASSAULT” “MAINTENANCE RUN” “MEDIA REQUEST FOR IN” “MEET OTHER LAW ENFORCEMENT/AGENCY” “MEET POLICE OFFICER-”
, “MENACING IN PROG/JO” “MENACING J/O” “MENTALLY IMPAIRED NON VIOL” “MENTALLY IMPAIRED VIOL” “MENTALLY IMPAIRED-NO”
, “MISSING PERSON REPOR” “NEIGHBOR TROUBLE” “NEIGHBOR TROUBLE - NON VIOL” “NOISE COMPLAINT” “NON-RESIDENTIAL BURG”
, “NONRESIDENTIAL ALARM DROP” “OFF DUTY POLICE DETAILS” “OFFICER BACKUP/NON ASSISTANCE” “PANHANDLER” “PERS DOWN POL ONLY” “PERSON ARMED W/ GUN” “PERSON ARMED W/WEAPON” “PERSON DOWN - UNK REASON” “PERSON DOWN AND OUT”
, “PERSON W/ WEAPON (OT” “PRISONER” “PROSTITUTE COMPLAINT” “PROWLER” “RAPE J/O OR IN PROGRESS” “REPO” “RESIDENTIAL ALARM DROP” “RESIDENTIAL BURGLAR” “ROBBERY IN PROGRESS-”
, “ROBBERY J/O OR IN PROGRESS” “ROBBERY REPORT” “SERVICE” “SEX OFFENSE OTHER THAN RAPE” “SEX OFFENSES OTHER T”
, “SHOOTING ALREADY OCC” “SHOOTING JO POL ONLY” “SHOT SPOTTER ACTIVITY” “SHOTS FIRED” “SPECIAL DETAIL”
, “STALKING J/O OR IN PROGRESS” “STATION RUN” “STRUCTURE FIRE” “TEST INCIDENT” “THEFT IN PROG/JO” “THEFT J/O OR IN PROGRESS” “THEFT REPORT” “TRANSFERRED CALL” “TRESPASSER” “TRESPASSERS”
,“TOWED VEH - FOR RECORDS USE ONLY” “WARRANT SERVICE”
, “HOLDUP/PANIC ALARM”
The majority are not listed. We also kept
CINstops <- CINstops %>%
filter(
str_detect(reason_for_stop, 'PATROL|DISORD|HAZARD|INV|PARKING|QUERY|SUSPECT|SUSPICIOUS|TRAFFIC|UNKNOWN')| is.na(reason_for_stop))
‘DIRECTED PATROL|DIRECTED PATROL - VEHICLE|DIRECTED PATROL - WALKING|DIRECTED PATROL-WALK| DISORD GROUP|DISORDERLY CROWD|DISORDERLY PERSON|HAZARD TO TRAFFIC/PEDESTRIAN|INV DRUG COMPLAINT|INV POSSIBLE WANTED|INV POSSIBLE WANTED PERSON|INV UNKNOWN TROUBLE|INVESTIGATE|INVESTIGATION|JUVENILE COMPLAINT|PARKING VIOLATION|QUERY|SUSPECT STOP |SUSPICIOUS PERSONS OR SITUATION|SUSPICIOUS SITUATION|TRAFFIC HAZARD|TRAFFIC PURSUIT|TRAFFIC STOP|TRAFFIC TIE|UNKNOWN TROUBLE’), is.na(reason_for_stop))
Looking at Arrest Rates- within each race
ArrestsByRace <- CINstops %>%
group_by(subject_race, arrest_made) %>%
summarize(Total =n()) %>%
group_by(subject_race) %>%
mutate(pct=Total/sum(Total))
datatable(ArrestsByRace)
A plot of that.
ArrestsRateByRace <- na.omit(ArrestsByRace)
ggplot(ArrestsRateByRace, aes(x=subject_race, y=Total, fill=arrest_made))+
geom_col(position="dodge")
#ggsave("Cincinnati2012to2017ArrestRateByRace.jpg", plot = last_plot())
Looking at the arrest percentage overall
Arrest_Percentage_ByRace <- CINstops %>%
filter(arrest_made=="TRUE") %>%
group_by(subject_race) %>%
summarize(Total=n()) %>%
mutate(pct=Total/sum(Total))
datatable(Arrest_Percentage_ByRace)
Note: Cincy doesn’t have search data.
Filter out NAs in lat/lng
CINstops <- filter(CINstops, lng!="NA")
Counting Race
CINstops %>%
count(subject_race)
## # A tibble: 6 x 2
## subject_race n
## <chr> <int>
## 1 asian/pacific islander 1574
## 2 black 173446
## 3 hispanic 497
## 4 other 21
## 5 unknown 5866
## 6 white 124171
Plot of year and race
CINstops %>%
count(year = year(date), subject_race) %>%
ggplot(aes(x = year, y = n, color = subject_race)) +
geom_point() +
geom_line()
This creates a proportion
population_2016 <- population_2016 %>%
mutate(Proportion_Of_Population = CINnum_people / sum(CINnum_people)) #I added this to the populatin data frame
Stops percentage versus population percentage
StopRateDataTable <- CINstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>%
left_join(
population_2016,
by = "subject_race"
) %>%
mutate(stop_rate = n / sum(n))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable)
So it looks like blacks are 43 percent of the population though they are 57 percent of the stops.
Source for this is the Mo. Attorney General.
Using these numbers: ( POP: 49% white, 43% black, 8% other)
(Stops: 57% black, 41 white 2% others)
White disparity index value = stop % divided by pop %): (41/49) = 0.8367347
Black DIV: (57/43) = 1.325581
black stop rate is 1.58 times that of white stop rate (1.325581/0.8367347)
Thus, accounting for the racial proportion of Cincinnati’s 2016 population, blacks were stopped at a rate 58% higher than whites.
41/49
## [1] 0.8367347
57/43
## [1] 1.325581
1.325581/0.8367347
## [1] 1.584231
Filtering for vehichular data
CINstops <- CINstops %>% filter(type == "vehicular") #Note this is pretty much all stop data
Calculating the sunset and dusk times for Cincinnati in 2016.
(Note that we distinguish here between sunset and dusk. Sunset is the point in time where the sun dips below the horizon. Dusk, also known as the end of civil twilight, is the time when the sun is six degrees below the horizon, and it is widely considered to be dark. We’ll explain why this is relevant in a moment.)
# Get timezone for Cincinnati
CINtz <- lutz::tz_lookup_coords(CINcenter_lat, CINcenter_lng, warn = F)
# Helper function
time_to_minute <- function(time) {
hour(lubridate::hms(time)) * 60 + minute(lubridate::hms(time))
}
# Compute sunset time for each date in our dataset
sunset_times <-
CINstops %>%
mutate(
lat = CINcenter_lat,
lon = CINcenter_lng
) %>%
select(date, lat, lon) %>%
distinct() %>%
getSunlightTimes(
data = .,
keep = c("dawn","sunrise", "sunset", "dusk"),
tz = CINtz
) %>%
mutate_at(vars("dawn","sunrise","sunset", "dusk"), ~format(., "%H:%M:%S")) %>%
mutate(
sunset_minute = time_to_minute(sunset),
dusk_minute = time_to_minute(dusk),
dawn_minute = time_to_minute(dawn),
sunrise_minute = time_to_minute(sunrise),
date = ymd(str_sub(date, 1, 10))
) %>%
select(date, dawn, sunrise, sunset, dusk, ends_with("minute"))
sunset_times
, what is Cincy’s inter-twilight period?sunset_times %>%
filter(dusk == min(dusk) | dusk == max(dusk) | dawn == min(dawn) | dawn == max(dawn))
## date dawn sunrise sunset dusk sunset_minute
## 1 2010-06-15 05:40:41 06:12:51 21:06:57 21:39:07 1266
## 2 2017-06-15 05:40:41 06:12:51 21:07:01 21:39:11 1267
## 3 2010-11-06 07:44:28 08:12:29 18:33:30 19:01:30 1113
## 4 2012-06-15 05:40:41 06:12:52 21:07:07 21:39:18 1267
## 5 2016-06-26 05:42:55 06:15:05 21:09:19 21:41:30 1269
## 6 2009-06-14 05:40:41 06:12:50 21:06:40 21:38:49 1266
## 7 2012-12-06 07:15:05 07:44:53 17:16:45 17:46:32 1036
## 8 2016-06-15 05:40:41 06:12:52 21:07:06 21:39:17 1267
## 9 2009-12-06 07:14:54 07:44:41 17:16:45 17:46:32 1036
## 10 2011-06-15 05:40:41 06:12:51 21:06:51 21:39:00 1266
## 11 2010-12-07 07:15:31 07:45:20 17:16:44 17:46:32 1036
## 12 2015-12-07 07:15:16 07:45:04 17:16:44 17:46:32 1036
## 13 2017-06-14 05:40:41 06:12:50 21:06:38 21:38:47 1266
## 14 2011-12-07 07:15:18 07:45:06 17:16:44 17:46:32 1036
## 15 2009-06-15 05:40:41 06:12:52 21:07:02 21:39:13 1267
## 16 2012-06-14 05:40:41 06:12:50 21:06:45 21:38:54 1266
## 17 2010-12-06 07:14:41 07:44:27 17:16:46 17:46:32 1036
## 18 2009-06-27 05:43:14 06:15:23 21:09:20 21:41:30 1269
## 19 2014-12-06 07:14:39 07:44:25 17:16:46 17:46:32 1036
## 20 2016-12-06 07:15:03 07:44:51 17:16:45 17:46:32 1036
## 21 2012-06-26 05:42:55 06:15:06 21:09:19 21:41:30 1269
## 22 2012-06-27 05:43:19 06:15:28 21:09:20 21:41:30 1269
## 23 2017-12-06 07:14:50 07:44:37 17:16:45 17:46:32 1036
## 24 2009-06-26 05:42:51 06:15:01 21:09:19 21:41:30 1269
## 25 2011-12-06 07:14:28 07:44:13 17:16:47 17:46:32 1036
## 26 2010-06-27 05:43:07 06:15:17 21:09:20 21:41:30 1269
## 27 2014-06-15 05:40:41 06:12:51 21:06:56 21:39:06 1266
## 28 2013-06-14 05:40:41 06:12:50 21:06:39 21:38:48 1266
## 29 2015-06-27 05:43:01 06:15:11 21:09:20 21:41:30 1269
## 30 2016-06-14 05:40:41 06:12:50 21:06:44 21:38:53 1266
## 31 2013-06-27 05:43:13 06:15:22 21:09:20 21:41:30 1269
## 32 2017-06-27 05:43:12 06:15:21 21:09:20 21:41:30 1269
## 33 2014-12-07 07:15:29 07:45:18 17:16:44 17:46:32 1036
## 34 2016-06-27 05:43:18 06:15:27 21:09:20 21:41:30 1269
## 35 2013-06-15 05:40:41 06:12:51 21:07:01 21:39:12 1267
## 36 2011-06-27 05:43:01 06:15:12 21:09:20 21:41:30 1269
## 37 2015-06-15 05:40:41 06:12:51 21:06:50 21:39:00 1266
## 38 2014-06-27 05:43:07 06:15:16 21:09:20 21:41:30 1269
## 39 2013-12-06 07:14:52 07:44:39 17:16:45 17:46:32 1036
## dusk_minute dawn_minute sunrise_minute
## 1 1299 340 372
## 2 1299 340 372
## 3 1141 464 492
## 4 1299 340 372
## 5 1301 342 375
## 6 1298 340 372
## 7 1066 435 464
## 8 1299 340 372
## 9 1066 434 464
## 10 1299 340 372
## 11 1066 435 465
## 12 1066 435 465
## 13 1298 340 372
## 14 1066 435 465
## 15 1299 340 372
## 16 1298 340 372
## 17 1066 434 464
## 18 1301 343 375
## 19 1066 434 464
## 20 1066 435 464
## 21 1301 342 375
## 22 1301 343 375
## 23 1066 434 464
## 24 1301 342 375
## 25 1066 434 464
## 26 1301 343 375
## 27 1299 340 372
## 28 1298 340 372
## 29 1301 343 375
## 30 1298 340 372
## 31 1301 343 375
## 32 1301 343 375
## 33 1066 435 465
## 34 1301 343 375
## 35 1299 340 372
## 36 1301 343 375
## 37 1299 340 372
## 38 1301 343 375
## 39 1066 434 464
So we are looking at times between 5:16 and 9:41. We need to get those times and take out Twilight.
CIN_VODstops <-
CINstops %>%
left_join(
sunset_times,
by = "date"
) %>%
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute | minute < dawn_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>%
filter(
# Filter to get only the intertwilight period
###Skipping this part
#minute >= min_dusk_minute,
minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk
!(minute > sunset_minute & minute < dusk_minute)
)
Filter to just that time
CIN_VODstops <- CIN_VODstops %>%
filter(time > hm("17:16"), time < hm("21:41")) #Sunset in winter and summer in Cincy
If you want to ONLY look at highway stops
CIN_VODstops <- CIN_VODstops %>% filter(str_detect(location, 'I75|I71|I74|I 75|I 71|I 74'))
StopsByYearInDarkness <- CIN_VODstops %>%
filter(is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total) %>%
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CIN_VODstops %>%
filter(!is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total)%>%
mutate(sunshine = "Daylight")
TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>%
group_by(year)
#View(TwilightTable)
This is the raw numbers.
TwilightTable <- TwilightTable %>%
mutate(others = `unknown`+`asian/pacific islander`, total_stopped = `black` +`white`+`unknown`+`asian/pacific islander`) %>%
group_by(year) %>%
mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=others/total_stopped)
datatable(TwilightTable)
This looks at the percent black and percent white stopped in daylight, and during the same time period but because of the time change, in darkness.
TwilightTableTogether <- TwilightTable %>%
group_by(sunshine) %>%
summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
Percent_of_Whites_Stopped=mean(white_percentage),
other=mean(other_percentage))
datatable(TwilightTableTogether)
Now, the same analysis, but with non-highway vehicular stops. (If you’re stopping someone on foot, you can see that peron in daylight or darkness.)
Bringing in Cincinnati stops again and getting the data to the format we want.
CINstops <- rio::import("https://stacks.stanford.edu/file/druid:hp256wp2687/hp256wp2687_oh_cincinnati_2019_08_13.csv.zip")
CINstops <- filter(CINstops, !is.na(date))
CINstops$date <- as.Date(CINstops$date, format = "%Y-%m-%d")
CINstops <- CINstops %>%
filter(year(date) < 2017, year(date) > 2008) %>%
filter(type == "vehicular")
CINstops$time<-parse_hms(CINstops$time)
CINstops <- CINstops %>%
filter(
str_detect(reason_for_stop, 'PATROL|DISORD|HAZARD|INV|PARKING|QUERY|SUSPECT|SUSPICIOUS|TRAFFIC|UNKNOWN')| is.na(reason_for_stop))
CIN_VODstops <-
CINstops %>%
left_join(
sunset_times,
by = "date"
) %>%
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute | minute < dawn_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>%
filter(
# Filter to get only the intertwilight period
###Skipping this part
#minute >= min_dusk_minute,
minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk
!(minute > sunset_minute & minute < dusk_minute)
)
Filter to just that time
CIN_VODstops <- CIN_VODstops %>%
filter(time > hm("17:16"), time < hm("21:41")) #Sunset in winter and summer in Cincy
If you want to ONLY look at street stops
CIN_VODstops <- CIN_VODstops %>% filter(!str_detect(location, 'I75|I71|I74|I 75|I 71|I 74'))
StopsByYearInDarkness <- CIN_VODstops %>%
filter(is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total) %>%
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CIN_VODstops %>%
filter(!is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total)%>%
mutate(sunshine = "Daylight")
TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>%
group_by(year)
#View(TwilightTable)
This is the raw numbers.
TwilightTable <- TwilightTable %>%
mutate(other = `unknown`+`asian/pacific islander`, total_stopped = `black` +`white`+`unknown`+`asian/pacific islander`) %>%
group_by(year) %>%
mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
datatable(TwilightTable)
This looks at the percent black and percent white stopped in daylight, and during the same time period but because of the time change, in darkness.
TwilightTableTogether <- TwilightTable %>%
group_by(sunshine) %>%
summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
Percent_of_Whites_Stopped=mean(white_percentage),
other=mean(other_percentage))
datatable(TwilightTableTogether)
Download Data from CensusReporter.org https://censusreporter.org/data/table/?table=B02001&geo_ids=16000US3915000,150|16000US3915000&primary_geo_id=16000US3915000
From this URL: https://censusreporter.org/data/table/?table=B02001&geo_ids=16000US3915000,150|16000US3915000&primary_geo_id=16000US3915000 download the shape file. (For the all block groups) I saved it in my working directory.
Getting race info for each neighborhood from the census
#You'll have to set your file pane location here
CincyRaceInfoByLocation2 <- sf::st_read("acs2017_5yr_B02001_15000US390610047011.shp") %>%
mutate(
Total=B02001001,
White_Alone=B02001002,
Black_Alone=B02001003,
PctWhite = round((White_Alone/Total)*100, 1),
PctBlack = round((Black_Alone/Total)*100, 1)) %>%
slice(-n())
## Reading layer `acs2017_5yr_B02001_15000US390610047011' from data source `/Users/Lucia/Code/SunlightStops/SunlightStopsStory/acs2017_5yr_B02001_15000US390610047011.shp' using driver `ESRI Shapefile'
## Simple feature collection with 298 features and 22 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -84.71216 ymin: 39.05188 xmax: -84.3454 ymax: 39.22502
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
Now we’re taking a look at our cincinnati neighborhood stops
# We only need the columns with the latitude and longitude
CINstops <- rio::import("https://stacks.stanford.edu/file/druid:hp256wp2687/hp256wp2687_oh_cincinnati_2019_08_13.csv.zip")
CINstops <- filter(CINstops, !is.na(date))
CINstops$date <- as.Date(CINstops$date, format = "%Y-%m-%d")
CINstops <- CINstops %>%
filter(year(date) < 2017, year(date) > 2008)
CINstops$time<-parse_hms(CINstops$time)
# Check and eliminate the rows that don't have location information
CINstops <- CINstops %>%
filter(lat!="NA")
CINstops <- CINstops %>%
filter(
str_detect(reason_for_stop, 'PATROL|DISORD|HAZARD|INV|PARKING|QUERY|SUSPECT|SUSPICIOUS|TRAFFIC|UNKNOWN')| is.na(reason_for_stop))
And filter to only neighborhoods, take out Antartica (A small number may have gotten their lat and lon mixed up, which gave us points in Antartica.)
#We need the street stops only here. Stopping somone on the highway is not really "in" a neighborhood
CINstops <- CINstops %>%
filter(!lat<1,!lng>1) %>%
filter(!str_detect(location,'I75|I71|I74|I 75|I 71|I 74'))
#filter(type=="vehicular", lat<46, lng>1)
Now, we can summarize the data by count and merge it back to the shape file and visualize it.
#Or, Getting Hamilton block groups (Cincinnati is entirely within Hamilton County)
library(tigris)
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
##
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
##
## plot
options(tigris_class = "sf")
HamiltonBlockGroups <- block_groups("OH",county = "Hamilton")
CINstops_spatial <- CINstops %>%
st_as_sf(coords=c("lng", "lat"), crs="+proj=longlat") %>%
st_transform(crs=st_crs(HamiltonBlockGroups))
#Putting them altogether
CINStopsWithTracts <- st_join(HamiltonBlockGroups, CINstops_spatial)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
#Vlookup
CINStopsWithTracts$GEOID <- paste0("15000US",CINStopsWithTracts$GEOID)
This puts the stops and race percentage of each block together.
CINstops_With_Race_and_Tracts <- CINStopsWithTracts
CINstops_With_Race_and_Tracts$PctWhite = CincyRaceInfoByLocation2$PctWhite[match(CINStopsWithTracts$GEOID, CincyRaceInfoByLocation2$geoid)]
CINstops_With_Race_and_Tracts$PctBlack = CincyRaceInfoByLocation2$PctBlack[match(CINStopsWithTracts$GEOID, CincyRaceInfoByLocation2$geoid)]
CINstops_With_Race_and_Tracts$Total_People_Per_Census_Block = CincyRaceInfoByLocation2$Total[match(CINStopsWithTracts$GEOID, CincyRaceInfoByLocation2$geoid)]
And finding the total number of stops per resident in neighbhorhoods with at least .75 percent white or black residents
CINstops_With_Race_and_Tracts <- CINstops_With_Race_and_Tracts %>%
filter(PctWhite!="NA", PctBlack!="NA")
Mostly_White_Cincy <- CINstops_With_Race_and_Tracts %>%
filter(PctWhite>75) %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
Mostly_Black_Cincy <- CINstops_With_Race_and_Tracts %>%
filter(PctBlack>75) %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
Mixed_Cincy <- CINstops_With_Race_and_Tracts %>%
filter(!PctBlack>75|!PctWhite>75) %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
mean(Mostly_White_Cincy$Per_Capita)
## [1] 0.5271393
mean(Mostly_Black_Cincy$Per_Capita)
## [1] 1.311483
This looks at the rate in ALL predominantely black areas and all predominantly white areas together.
And figures out the percentage more in black areas.
Total_People_In_Mostly_White_Cincy_Blocks <- sum(Mostly_White_Cincy$Total_In_Block)
Total_People_In_Mostly_Black_Cincy_Blocks <- sum(Mostly_Black_Cincy$Total_In_Block)
Total_People_In_Mixed_Blocks <-sum(Mixed_Cincy$Total_In_Block)
Total_Stops_In_Mostly_White_Cincy_Blocks <-sum(Mostly_White_Cincy$Total)
Total_Stops_In_Mostly_Black_Cincy_Blocks <-sum(Mostly_Black_Cincy$Total)
Total_Stops_In_Mixed_Blocks <-sum(Mixed_Cincy$Total)
Total_Stops_In_Mostly_White_Cincy_Blocks/Total_People_In_Mostly_White_Cincy_Blocks
## [1] 0.3992041
Total_Stops_In_Mostly_Black_Cincy_Blocks/Total_People_In_Mostly_Black_Cincy_Blocks
## [1] 0.8788373
CIN_white_rate <- Total_Stops_In_Mostly_White_Cincy_Blocks/Total_People_In_Mostly_White_Cincy_Blocks
CIN_black_rate <- Total_Stops_In_Mostly_Black_Cincy_Blocks/Total_People_In_Mostly_Black_Cincy_Blocks
(CIN_black_rate-CIN_white_rate)/CIN_white_rate
## [1] 1.201474
For the map, we want to look at each tract, and see the percentage stopped for black, white, and other. (And then Tableau has built-in race by block shading.)
CIN_Percent_Pulled_over_By_Block <- CINstops_With_Race_and_Tracts %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block),Total_In_Block=mean(Total_People_Per_Census_Block))
CIN_Percent_Pulled_over_By_Block <- st_set_geometry(CIN_Percent_Pulled_over_By_Block, NULL)
#Uncomment this to export it
#rio::export(CIN_Percent_Pulled_over_By_Block, "Cincinnati number of stops per census block.csv")
## Load the data
# Replace the path below with the path to where your data lives
data_path <- "oh_columbus_without_duplicates.csv" #Note: the Stanford data had a lot of duplicates, so we had to work with this data.
COLstops <- read_csv(data_path)
## Parsed with column specification:
## cols(
## raw_row_number = col_character(),
## date = col_date(format = ""),
## time = col_time(format = ""),
## location = col_character(),
## lat = col_double(),
## lng = col_double(),
## precinct = col_double(),
## zone = col_double(),
## subject_race = col_character(),
## subject_sex = col_character(),
## type = col_character(),
## arrest_made = col_logical(),
## citation_issued = col_logical(),
## warning_issued = col_logical(),
## outcome = col_character(),
## search_conducted = col_logical(),
## reason_for_stop = col_character()
## )
table(COLstops$subject_race)
##
## asian/pacific islander black hispanic
## 2633 56001 3288
## other/unknown white
## 4481 61754
# Additional data and fixed values we'll be using
population_2016 <- tibble(
subject_race = c(
"asian/pacific islander", "black", "hispanic", "other/unknown","white"
),
COLnum_people = c(42308, 230963, 48331, 33997, 481439) #https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=CF
) %>%
mutate(subject_race = as.factor(subject_race))
COLcenter_lat <- 39.9612
COLcenter_lng <- -82.9988
The Columns that we have
colnames(COLstops)
## [1] "raw_row_number" "date" "time"
## [4] "location" "lat" "lng"
## [7] "precinct" "zone" "subject_race"
## [10] "subject_sex" "type" "arrest_made"
## [13] "citation_issued" "warning_issued" "outcome"
## [16] "search_conducted" "reason_for_stop"
How many stops do we have in our dataset?
nrow(COLstops)
## [1] 128157
What date range does our data cover?
min(COLstops$date)
## [1] "2012-01-01"
max(COLstops$date)
## [1] "2016-12-30"
In Cincinnati we filter by date to get whole years. This is not necessary for Columbus Data
unique(COLstops$type) # this allows us to look at just the first 10 rows
## [1] "vehicular"
NOTE: COLstops are ONLY vehicular, unlike Cincinnati.
This looks at the stop rate versus the population
StopRateDataTable <- COLstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>%
left_join(
population_2016,
by = "subject_race"
) %>%
mutate(stop_rate = n / sum(n), pop_rate = COLnum_people/sum(COLnum_people))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable)
So here, we see blacks are stopped at a rate much higher than the population.
How about counting how many stops by year and race?
COLstops %>%
group_by(year(date)) %>%
# notice that you can also mutate `date` *within* the count function
count(year = year(date), subject_race)
## # A tibble: 25 x 4
## # Groups: year(date) [5]
## `year(date)` year subject_race n
## <dbl> <dbl> <chr> <int>
## 1 2012 2012 asian/pacific islander 509
## 2 2012 2012 black 11796
## 3 2012 2012 hispanic 737
## 4 2012 2012 other/unknown 857
## 5 2012 2012 white 14976
## 6 2013 2013 asian/pacific islander 528
## 7 2013 2013 black 9822
## 8 2013 2013 hispanic 602
## 9 2013 2013 other/unknown 756
## 10 2013 2013 white 12160
## # … with 15 more rows
A simple visualization of that.
COLstops %>%
count(year = year(date), subject_race) %>%
ggplot(aes(x = year, y = n, color = subject_race)) +
geom_point() +
geom_line()
Now looking at Arrests
This looks at the number arrested within each race group.
ArrestsByRaceCOL <- COLstops %>%
group_by(subject_race, arrest_made) %>%
summarize(Total =n()) %>%
group_by(subject_race) %>%
mutate(pct=Total/sum(Total))
datatable(ArrestsByRaceCOL)
ArrestsRateByRaceCOL <- na.omit(ArrestsByRace)
ggplot(ArrestsRateByRace, aes(x=subject_race, y=Total, fill=arrest_made))+
geom_col(position="dodge")
#ggsave("Cincinnati2012to2017ArrestRateByRace.jpg", plot = last_plot())
Arrests altogether
Arrest_Percentage_ByRace <- COLstops %>%
filter(arrest_made=="TRUE") %>%
group_by(subject_race) %>%
summarize(Total=n()) %>%
mutate(pct=Total/sum(Total))
datatable(Arrest_Percentage_ByRace)
SearchesByRaceCOL <- COLstops %>%
group_by(subject_race, search_conducted) %>%
summarize(Total =n()) %>%
group_by(subject_race) %>%
mutate(pct=Total/sum(Total))
datatable(SearchesByRaceCOL)
SearchesByRaceCOL <- na.omit(SearchesByRaceCOL)
ggplot(SearchesByRaceCOL, aes(x=subject_race, y=Total, fill=search_conducted))+
geom_col(position="dodge")
How many of those searched were arrested?
ArrestedAfterSearch <- COLstops %>%
filter(search_conducted=="TRUE") %>%
group_by(subject_race, arrest_made) %>%
summarize(Total =n()) %>%
group_by(subject_race) %>%
mutate(pct=Total/sum(Total))
datatable(ArrestedAfterSearch)
Search_Percentage_ByRace <- COLstops %>%
filter(search_conducted=="TRUE") %>%
group_by(subject_race) %>%
summarize(Total=n()) %>%
mutate(pct=Total/sum(Total))
datatable(Search_Percentage_ByRace)
###Columbus Disparity Index
white <- .48/.58
black <- .44/.28
black/white
## [1] 1.89881
library(dplyr)
COLstops <- COLstops %>% filter(!str_detect(location,'70|71'))
Note: Columbus had 10,495 stops that could not be geocoded. So these are removed because the rest of this analysis reqires latitude and longitude.This could affect the outcome. (Though the number is comparatively small.)
COLstops <- COLstops %>%
filter(lat!="NA")
Let’s calculate the sunset and dusk times for Columbus in 2016.
(Note that we distinguish here between sunset and dusk. Sunset is the point in time where the sun dips below the horizon. Dusk, also known as the end of civil twilight, is the time when the sun is six degrees below the horizon, and it is widely considered to be dark. We’ll explain why this is relevant in a moment.)
# Get timezone for Columbus
COLtz <- lutz::tz_lookup_coords(COLcenter_lat, COLcenter_lng, warn = F)
# Helper function
time_to_minute <- function(time) {
hour(lubridate::hms(time)) * 60 + minute(lubridate::hms(time))
}
# Compute sunset time for each date in our dataset
sunset_times <-
COLstops %>%
mutate(
lat = COLcenter_lat,
lon = COLcenter_lng
) %>%
select(date, lat, lon) %>%
distinct() %>%
getSunlightTimes(
data = .,
keep = c("sunset", "dusk"),
tz = COLtz
) %>%
mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>%
mutate(
sunset_minute = time_to_minute(sunset),
dusk_minute = time_to_minute(dusk),
date = ymd(str_sub(date, 1, 10))
) %>%
select(date, sunset, dusk, ends_with("minute"))
sunset_times
, what is Columbus’ inter-twilight period?sunset_times %>%
filter(dusk == min(dusk) | dusk == max(dusk))
## date sunset dusk sunset_minute dusk_minute
## 1 2012-06-26 21:06:01 21:38:49 1266 1298
## 2 2014-06-27 21:06:02 21:38:49 1266 1298
## 3 2014-12-07 17:08:12 17:38:27 1028 1058
## 4 2015-06-27 21:06:01 21:38:49 1266 1298
## 5 2015-12-07 17:08:12 17:38:27 1028 1058
## 6 2016-06-26 21:06:01 21:38:49 1266 1298
For a time in this range, like 6:30pm, part of the year it’s still light, and part of the year, it’s still dark. So we can compare the proportion of black drivers stopped at 6:30pm when it’s light (usually in the summer) to the proportion of black drivers stopped at 6:30pm when it’s dark (usually in the winter). This way, we’re really seeing the effect of darkness, since driving and deployment patterns tend to change only with clock time. For example, the demographics of commuters who are on the road at 6:30pm likely stays pretty steady throughout the year, compared to demographics of commuters at 5pm versus at 9pm.
One quick note: We want to make sure to filter out stops that occurred in the approximately 30-minute period between sunset and dusk (end of civil twilight), since it is ambiguous as to whether that period is light or dark, which would muddle up our analysis.
Let’s join the sunset times to our stops data and
CBUS_VOD_stops <-
COLstops %>%
left_join(
sunset_times,
by = "date"
) %>%
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>%
filter(
# Filter to get only the intertwilight period
#minute >= min_dusk_minute,
#minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk
!(minute > sunset_minute & minute < dusk_minute))
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs
How many stops are in our this new, filtered data frame?
CBUS_VOD_stops %>% nrow()
## [1] 92260
Filtering for just the timeframe needed.
CBUS_VOD_stops <- CBUS_VOD_stops %>%
filter(time > hm("17:08"), time < hm("21:06"))
StopsByYearInDarkness <- CBUS_VOD_stops %>%
filter(is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total) %>%
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CBUS_VOD_stops %>%
filter(!is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total)%>%
mutate(sunshine = "Daylight")
TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>%
group_by(year)
#View(TwilightTable)
TwilightTable <- TwilightTable %>%
mutate(other = `other/unknown`+`asian/pacific islander`, total_stopped = `black` +`white`+`other/unknown`+`asian/pacific islander`) %>%
group_by(year) %>%
mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
datatable(TwilightTable)
Now just daylight and darkness
TwilightTableTogether <- TwilightTable %>%
group_by(sunshine) %>%
summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
Percent_of_Whites_Stopped=mean(white_percentage),
other=mean(other_percentage))
datatable(TwilightTableTogether)
Now we load COLstops again, and do the same analysis, but for highway stops.
## Load the data
# Replace the path below with the path to where your data lives
data_path <- "oh_columbus_without_duplicates.csv"
COLstops <- read_csv(data_path)
## Parsed with column specification:
## cols(
## raw_row_number = col_character(),
## date = col_date(format = ""),
## time = col_time(format = ""),
## location = col_character(),
## lat = col_double(),
## lng = col_double(),
## precinct = col_double(),
## zone = col_double(),
## subject_race = col_character(),
## subject_sex = col_character(),
## type = col_character(),
## arrest_made = col_logical(),
## citation_issued = col_logical(),
## warning_issued = col_logical(),
## outcome = col_character(),
## search_conducted = col_logical(),
## reason_for_stop = col_character()
## )
library(dplyr)
COLstops <- COLstops %>% filter(str_detect(location,'70|71'))
Note: Columbus had 10,495 stops that could not be geocoded. So these are removed because the rest of this analysis reqires latitude and longitude.This could affect the outcome. (Though the number is comparatively small.)
COLstops <- COLstops %>%
filter(lat!="NA")
(Note that we distinguish here between sunset and dusk.)
# Get timezone for Columbus
COLtz <- lutz::tz_lookup_coords(COLcenter_lat, COLcenter_lng, warn = F)
# Helper function
time_to_minute <- function(time) {
hour(lubridate::hms(time)) * 60 + minute(lubridate::hms(time))
}
# Compute sunset time for each date in our dataset
sunset_times <- COLstops %>%
mutate(lat = COLcenter_lat,
lon = COLcenter_lng) %>%
select(date, lat, lon) %>%
distinct() %>%
getSunlightTimes(data = .,
keep = c("sunset", "dusk"), tz = COLtz) %>%
mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>%
mutate(
sunset_minute = time_to_minute(sunset),
dusk_minute = time_to_minute(dusk),
date = ymd(str_sub(date, 1, 10))) %>%
select(date, sunset, dusk, ends_with("minute"))
sunset_times
, what is Columbus’ inter-twilight period?sunset_times %>%
filter(dusk == min(dusk) | dusk == max(dusk))
## date sunset dusk sunset_minute dusk_minute
## 1 2012-06-26 21:06:01 21:38:49 1266 1298
## 2 2014-06-27 21:06:02 21:38:49 1266 1298
## 3 2014-12-07 17:08:12 17:38:27 1028 1058
## 4 2015-06-27 21:06:01 21:38:49 1266 1298
## 5 2015-12-07 17:08:12 17:38:27 1028 1058
## 6 2016-06-26 21:06:01 21:38:49 1266 1298
Merging veil of darkness stop times
CBUS_VOD_stops <-
COLstops %>%
left_join(
sunset_times,
by = "date"
) %>%
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>%
filter(
# Filter to get only the intertwilight period
#minute >= min_dusk_minute,
#minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk
!(minute > sunset_minute & minute < dusk_minute))
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs
How many stops are in our this new, filtered data frame?
CBUS_VOD_stops %>% nrow()
## [1] 21869
Filtering for just the timeframe needed.
CBUS_VOD_stops <- CBUS_VOD_stops %>%
filter(time > hm("17:08"), time < hm("21:06"))
Comparing stops during daylight savings time
StopsByYearInDarkness <- CBUS_VOD_stops %>%
filter(is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total) %>%
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CBUS_VOD_stops %>%
filter(!is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total)%>%
mutate(sunshine = "Daylight")
TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>%
group_by(year)
#View(TwilightTable)
TwilightTable <- TwilightTable %>%
mutate(other = `other/unknown`+`asian/pacific islander`, total_stopped = `black` +`white`+`other/unknown`+`asian/pacific islander`) %>%
group_by(year) %>%
mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
datatable(TwilightTable)
Now just daylight and darkness
TwilightTableTogether <- TwilightTable %>%
group_by(sunshine) %>%
summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
Percent_of_Whites_Stopped=mean(white_percentage),
other=mean(other_percentage))
datatable(TwilightTableTogether)
Note these codes are from the data dictionary.
#You'll have to set your file pane location here
CBUS_RaceInfoByLocation <- sf::st_read("Columbus Block Groups by Race- acs2017_5yr_B02001_15000US390490046104/acs2017_5yr_B02001_15000US390490046104.shp") %>%
mutate(
Total=B02001001,
White_Alone=B02001002,
Black_Alone=B02001003,
PctWhite = round((White_Alone/Total)*100, 1),
PctBlack = round((Black_Alone/Total)*100, 1)) %>%
slice(-n())
## Reading layer `acs2017_5yr_B02001_15000US390490046104' from data source `/Users/Lucia/Code/SunlightStops/SunlightStopsStory/Columbus Block Groups by Race- acs2017_5yr_B02001_15000US390490046104/acs2017_5yr_B02001_15000US390490046104.shp' using driver `ESRI Shapefile'
## Simple feature collection with 700 features and 22 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -83.23604 ymin: 39.79894 xmax: -82.76946 ymax: 40.15735
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
#We need the street stops only here. Stopping somone on the highway is not really in a neighborhood
data_path <- "oh_columbus_without_duplicates.csv"
COLstops <- read_csv(data_path)
## Parsed with column specification:
## cols(
## raw_row_number = col_character(),
## date = col_date(format = ""),
## time = col_time(format = ""),
## location = col_character(),
## lat = col_double(),
## lng = col_double(),
## precinct = col_double(),
## zone = col_double(),
## subject_race = col_character(),
## subject_sex = col_character(),
## type = col_character(),
## arrest_made = col_logical(),
## citation_issued = col_logical(),
## warning_issued = col_logical(),
## outcome = col_character(),
## search_conducted = col_logical(),
## reason_for_stop = col_character()
## )
COLstops <- COLstops %>% filter(!str_detect(location,'70|71'))
# Check and eliminate the rows that don't have location information
COLstops <- COLstops %>%
filter(lat!="NA")
#Or, Getting Hamilton block groups (Cincinnati is entirely within Hamilton County)
library(tigris)
options(tigris_class = "sf")
FranklinBlockGroups <- block_groups("OH",county = "Franklin")
COLstops_spatial <- COLstops %>%
st_as_sf(coords=c("lng", "lat"), crs="+proj=longlat") %>%
st_transform(crs=st_crs(FranklinBlockGroups))
#Putting them altogether
COLStopsWithTracts <- st_join(FranklinBlockGroups, COLstops_spatial)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
#plotting them
ggplot(FranklinBlockGroups) +
geom_sf(color="black") +
geom_point(data=COLstops, aes(x=lng, y=lat), color="red", alpha=0.05) +
theme_void() +
theme(panel.grid.major = element_line(colour = 'transparent')) +
labs(title="Traffic stops in Franklin")
#Vlookup
COLStopsWithTracts$GEOID <- paste0("15000US",COLStopsWithTracts$GEOID)
This puts the stops and race percentage of each block together.
COLstops_With_Race_and_Tracts <- COLStopsWithTracts
COLstops_With_Race_and_Tracts$PctWhite = CBUS_RaceInfoByLocation$PctWhite[match(COLStopsWithTracts$GEOID, CBUS_RaceInfoByLocation$geoid)]
COLstops_With_Race_and_Tracts$PctBlack = CBUS_RaceInfoByLocation$PctBlack[match(COLStopsWithTracts$GEOID, CBUS_RaceInfoByLocation$geoid)]
COLstops_With_Race_and_Tracts$Total_People_Per_Census_Block = CBUS_RaceInfoByLocation$Total[match(COLStopsWithTracts$GEOID, CBUS_RaceInfoByLocation$geoid)]
COLstops_With_Race_and_Tracts <- COLstops_With_Race_and_Tracts %>%
filter(PctWhite!="NA", PctBlack!="NA")
Mostly_White_CBUS <- COLstops_With_Race_and_Tracts %>%
filter(PctWhite>75) %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
Mostly_Black_CBUS <- COLstops_With_Race_and_Tracts %>%
filter(PctBlack>75) %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
mean(Mostly_White_CBUS$Per_Capita)
## [1] 0.1105331
mean(Mostly_Black_CBUS$Per_Capita)
## [1] 0.1733853
This looks at the rate in ALL predominantely black areas and all predominantely white areas together.
And figures out the percentage more in black areas.
Total_People_In_Mostly_White_CBUS_Blocks <- sum(Mostly_White_CBUS$Total_In_Block)
Total_People_In_Mostly_Black_CBUS_Blocks <- sum(Mostly_Black_CBUS$Total_In_Block)
Total_Stops_In_Mostly_White_CBUS_Blocks <-sum(Mostly_White_CBUS$Total)
Total_Stops_In_Mostly_Black_CBUS_Blocks <-sum(Mostly_Black_CBUS$Total)
Total_Stops_In_Mostly_White_CBUS_Blocks/Total_People_In_Mostly_White_CBUS_Blocks
## [1] 0.07372752
Total_Stops_In_Mostly_Black_CBUS_Blocks/Total_People_In_Mostly_Black_CBUS_Blocks
## [1] 0.1356254
CBUS_white_rate <- Total_Stops_In_Mostly_White_CBUS_Blocks/Total_People_In_Mostly_White_CBUS_Blocks
CBUS_black_rate <- Total_Stops_In_Mostly_Black_CBUS_Blocks/Total_People_In_Mostly_Black_CBUS_Blocks
(CBUS_black_rate-CBUS_white_rate)/CBUS_white_rate
## [1] 0.8395498
For the map, we want to look at each tract, and see the percentage stopped for black, white, and other. (And then Tableau has built-in race by block shading.)
COL_Percent_Pulled_over_By_Block <- COLstops_With_Race_and_Tracts %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
COL_Percent_Pulled_over_By_Block <- st_set_geometry(COL_Percent_Pulled_over_By_Block, NULL)
#Uncomment this to export it
#rio::export(COL_Percent_Pulled_over_By_Block, "Columbus number of stops per census block.csv")
Note: We filed four separate open records requests from the Cleveland Police from March through September 2019, for 2015, 2016, 2017, and 2018 data. They kept giving us part of 2015 (mostly january), part of 2018 (mostly December,) and two versions of 2016 (though one was labeled 2017.) After driving from Columbus to Cleveland and asking again, they finally gave us 2017 as well. In PDF form. So this is not part of the Stanford data, this is data that was cleaned and geocoded by Eye on Ohio thanks to Google’s Maps API credits. You can check out the original PDFs here. [2017 stops ] (https://eyeonohio.com/wp-content/uploads/2019/11/Cleveland-Police-Stops.pdf) and [2016 stops.] (https://eyeonohio.com/wp-content/uploads/2019/11/2016_UTT.pdf)
## Load the data
CLEstops <- rio::import("CLE_stops_cleaned_and_geocoded.csv")
# Additional data and fixed values we'll be using
population_2017 <- tibble(
subject_race = c(
"asian/pacific islander", "black", "hispanic", "other/unknown","white"
),
num_people = c(7930, 193817, 43466, 11508, 131338) #This is where I got this data:https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=CF
#note "white" means 'non-hispanic white'
) %>%
mutate(subject_race = as.factor(subject_race))
#Have to add in the latitude and longitude of wherever you are (this is slightly different for CLEumbus)
center_lat <- 41.4993
center_lng <- -81.6944
Note the original data set has more rows, because each offense is a separate row. So someone getting two tickets would show up twice. And we didn’t want that.
colnames(CLEstops)
## [1] "Ticket_Date" "Ticket_Time" "TicketAddress" "DOBTicketed"
## [5] "Arrest" "Race" "Sex" "lat"
## [9] "lon"
Looking for duplicates
CLE_duplicates <- CLEstops %>%
group_by(Ticket_Date, Ticket_Time, DOBTicketed) %>%
summarize (Total=n())
head(CLE_duplicates)
## # A tibble: 6 x 4
## # Groups: Ticket_Date, Ticket_Time [5]
## Ticket_Date Ticket_Time DOBTicketed Total
## <chr> <chr> <chr> <int>
## 1 1/1/16 1:20 Aug 30 1978 2
## 2 1/1/16 7:46 Jan 4 1994 2
## 3 1/1/16 7:46 Jul 13 1989 2
## 4 1/1/16 9:32 Jan 1 1997 2
## 5 1/1/17 0:10 Apr 9 1993 2
## 6 1/1/17 0:34 Jun 20 1990 2
Chances are, two people with the same birth date probably did not get the a ticket at the same time on the same date. Looks like there are a lot of duplicates here that we will just get the unique stops from.
CLEstops <- dplyr::distinct(CLEstops)
Changing data and time to date and time formats
#CLEstops <- filter(CLEstops, !is.na(Ticket_Date))
CLEstops$Ticket_Date <- mdy(CLEstops$Ticket_Date)
CLEstops$Ticket_Time <-parse_hm(CLEstops$Ticket_Time)
CLEstops <- CLEstops %>%
mutate(date=Ticket_Date, time=Ticket_Time, subject_race=Race)
max(CLEstops$date)
## [1] "2017-12-30"
min(CLEstops$date)
## [1] "2016-01-01"
How does Cleveland count race?
CLEstops %>%
count(subject_race)
## # A tibble: 9 x 2
## subject_race n
## <chr> <int>
## 1 America! 3
## 2 Americar 66
## 3 Arab 200
## 4 Asian 171
## 5 Black 27860
## 6 Clevelam 1
## 7 Native H 26
## 8 Unknown 1809
## 9 White 16943
There’s some ambiguity in this data. Like, are Hispanics tagged as white? Does “Native H” mean Native American? “America!”… what does that mean? We have sent a request to the police about this.
You might want to do options(scipen=999) so that you don’t get scientific notation here.
# This method builds off of using `count` as above
CLEstops %>%
count(subject_race) %>%
mutate(prop = n / sum(n))
## # A tibble: 9 x 3
## subject_race n prop
## <chr> <int> <dbl>
## 1 America! 3 0.0000637
## 2 Americar 66 0.00140
## 3 Arab 200 0.00425
## 4 Asian 171 0.00363
## 5 Black 27860 0.592
## 6 Clevelam 1 0.0000212
## 7 Native H 26 0.000552
## 8 Unknown 1809 0.0384
## 9 White 16943 0.360
# This method uses the group_by/summarize paradigm
CLEstops %>%
group_by(subject_race) %>%
summarize(
n = n(),
prop = n / nrow(.)
)
## # A tibble: 9 x 3
## subject_race n prop
## <chr> <int> <dbl>
## 1 America! 3 0.0000637
## 2 Americar 66 0.00140
## 3 Arab 200 0.00425
## 4 Asian 171 0.00363
## 5 Black 27860 0.592
## 6 Clevelam 1 0.0000212
## 7 Native H 26 0.000552
## 8 Unknown 1809 0.0384
## 9 White 16943 0.360
At first glance, we see there are more stops of black drivers than any other race.
How about counting how many stops by year and race?
CLEstops %>%
group_by(year(date)) %>%
# notice that you can also mutate `date` *within* the count funciton
count(year = year(date), subject_race)
## # A tibble: 16 x 4
## # Groups: year(date) [2]
## `year(date)` year subject_race n
## <dbl> <dbl> <chr> <int>
## 1 2016 2016 America! 3
## 2 2016 2016 Americar 43
## 3 2016 2016 Arab 105
## 4 2016 2016 Asian 97
## 5 2016 2016 Black 14066
## 6 2016 2016 Native H 9
## 7 2016 2016 Unknown 903
## 8 2016 2016 White 8822
## 9 2017 2017 Americar 23
## 10 2017 2017 Arab 95
## 11 2017 2017 Asian 74
## 12 2017 2017 Black 13794
## 13 2017 2017 Clevelam 1
## 14 2017 2017 Native H 17
## 15 2017 2017 Unknown 906
## 16 2017 2017 White 8121
A simple visualization of this.
CLEstops %>%
count(month = month(date), subject_race) %>%
ggplot(aes(x = month, y = n, color = subject_race)) +
geom_point() +
geom_line()
Looks pretty consistent month to month.
Note: For the purposes of our analysis, we are going to combine “Native H”, Clevelandm, America, Americar, Arab, and Unknown together.
CLEstops$subject_race <-recode(CLEstops$subject_race, "Native H"="other/unknown",
"Clevelam"="hispanic",
"America!"="other/unknown",
"Americar"="other/unknown",
"Arab"="other/unknown",
"Unknown"="other/unknown",
"Asian"="asian/pacific islander",
"Black"="black",
"White"="white")
This preps the proportion.
population_2017 <- population_2017 %>%
mutate(Proportion_Of_Population = num_people / sum(num_people)) #I added this to the populatin data frame
This creates our stop rate data table.
StopRateDataTable <- CLEstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>%
left_join(
population_2017,
by = "subject_race"
) %>%
mutate(stop_rate = n / sum(n))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable)
So it looks like blacks are about 50 percent of the population though they are 60 percent of the stops. There’s a big discrepancy here though: Hispanics are 11 percent of the population, but there is no Cleveland “race” stop labeled “Hispanic.” (We put in one just so that it actually showed up in this table.) This isn’t inaccurate– legally, Hispanic is an “ethnicity” and not a “race.” Though 200 are listed as “Arab” which isn’t a race either.
population_2017 <- tibble(
subject_race = c(
"asian/pacific islander", "black","other/unknown","white"
),
num_people = c(8026, 196006, 30062, 154718) #This is from the 2017 American Community Survey of Cleveland. "Other/unknown" is: 12,479 for 'Some other Race', 60 for native Hawaiian, 1931 for American Indian/Alaskan Native, 15,592 for 'two or more races.' Food for thought: should we include the black and white as (6919) as part of African American?
) %>%
mutate(subject_race = as.factor(subject_race))
population_2017 <- population_2017 %>%
mutate(Proportion_Of_Population = num_people / sum(num_people))
CLEstops$subject_race <-recode(CLEstops$subject_race, "hispanic"="other/unknown")
StopRateDataTable_No_Hispanic <- CLEstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>%
left_join(
population_2017,
by = "subject_race"
) %>%
mutate(stop_rate = n / sum(n))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable_No_Hispanic)
Cleveland Disparity Index
white <- .36/.4
black <- .59/.5
black/white
## [1] 1.311111
sunset and dusk times for Cleveland.
(Note that we distinguish here between sunset and dusk. Sunset is the point in time where the sun dips below the horizon. Dusk, also known as the end of civil twilight, is the time when the sun is six degrees below the horizon, and it is widely considered to be dark.)
# Get timezone for Cleveland
tz <- lutz::tz_lookup_coords(center_lat, center_lng, warn = F)
# Helper function
time_to_minute <- function(time) {
hour(lubridate::hms(time)) * 60 + minute(lubridate::hms(time))
}
# Compute sunset time for each date in our dataset
sunset_times <- CLEstops %>%
mutate(
lat = center_lat,
lon = center_lng
) %>%
select(date, lat, lon) %>%
distinct() %>%
getSunlightTimes(
data = .,
keep = c("sunset", "dusk"),
tz = tz
) %>%
mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>%
mutate(
sunset_minute = time_to_minute(sunset),
dusk_minute = time_to_minute(dusk),
date = ymd(str_sub(date, 1, 10))
) %>%
select(date, sunset, dusk, ends_with("minute"))
sunset_times
, what is Cleveland’s inter-twilight period?sunset_times %>%
filter(dusk == min(dusk) | dusk == max(dusk))
## date sunset dusk sunset_minute dusk_minute
## 1 2016-12-08 16:58:18 17:29:29 1018 1049
## 2 2017-12-08 16:58:18 17:29:29 1018 1049
## 3 2017-12-07 16:58:20 17:29:29 1018 1049
## 4 2016-06-26 21:05:57 21:39:57 1265 1299
## 5 2016-12-07 16:58:19 17:29:29 1018 1049
## 6 2017-06-26 21:05:56 21:39:57 1265 1299
datatable(.Last.value)
Joining sunset times to our data.
CLE_VODstops <- CLEstops %>%
left_join(
sunset_times,
by = "date"
) %>%
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>%
filter(
# Filter to get only the intertwilight period
###Skipping this part
minute >= min_dusk_minute,
minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk (twilight)
!(minute > sunset_minute & minute < dusk_minute)
)
Filtering for just the timeframe needed.
CLE_VODstops <- CLE_VODstops %>%
filter(time > hm("16:58"), time < hm("21:39"))
library(dplyr)
CLE_VODstops <- CLE_VODstops %>%
filter(str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))
StopsByYearInDarkness <- CLE_VODstops %>%
filter(is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total) %>%
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CLE_VODstops %>%
filter(!is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total)%>%
mutate(sunshine = "Daylight")
TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>%
group_by(year)
datatable(TwilightTable)
TwilightTable <- TwilightTable %>%
mutate(other = `other/unknown`, total_stopped = `black` +`white`+`other/unknown`) %>%
group_by(sunshine) %>%
mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
#select(black, white, other, year) %>%
datatable(TwilightTable)
TwilightTableTogether <- TwilightTable %>%
group_by(sunshine) %>%
summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
Percent_of_Whites_Stopped=mean(white_percentage),
other=mean(other_percentage))
datatable(TwilightTableTogether)
CLE_VODstops <- CLEstops %>%
left_join(
sunset_times,
by = "date"
) %>%
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>%
filter(
# Filter to get only the intertwilight period
###Skipping this part
minute >= min_dusk_minute,
minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk (twilight)
!(minute > sunset_minute & minute < dusk_minute)
)
taking out the highways, looking for just street stops.
CLE_VODstops <- CLE_VODstops %>%
filter(!str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))
Filtering for just the timeframe needed.
CLE_VODstops <- CLE_VODstops %>%
filter(time > hm("16:58"), time < hm("21:39"))
StopsByYearInDarkness <- CLE_VODstops %>%
filter(is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total) %>%
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CLE_VODstops %>%
filter(!is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total)%>%
mutate(sunshine = "Daylight")
TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>%
group_by(year)
datatable(TwilightTable)
TwilightTable <- TwilightTable %>%
mutate(other = `other/unknown`, total_stopped = `black` +`white`+`other/unknown`) %>%
group_by(sunshine) %>%
mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
#select(black, white, other, year) %>%
datatable(TwilightTable)
TwilightTableTogether <- TwilightTable %>%
group_by(sunshine) %>%
summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
Percent_of_Whites_Stopped=mean(white_percentage),
other=mean(other_percentage))
datatable(TwilightTableTogether)
Looking at Arrest Rates
ArrestsByRace <- CLEstops %>%
group_by(subject_race, Arrest) %>%
summarize(Total =n()) %>%
group_by(subject_race) %>%
mutate(pct=Total/sum(Total))
datatable(ArrestsByRace)
ArrestsRateByRace <- na.omit(ArrestsByRace)
ggplot(ArrestsRateByRace, aes(x=subject_race, y=Total, fill=Arrest))+
geom_col(position="dodge")
#ggsave("Cincinnati2012to2017ArrestRateByRace.jpg", plot = last_plot())
Arrest_Percentage_ByRace <- CLEstops %>%
filter(Arrest=="Yes") %>%
group_by(subject_race) %>%
summarize(Total=n()) %>%
mutate(pct=Total/sum(Total))
datatable(Arrest_Percentage_ByRace)
Note: Cleveland doesn’t have search data.
Note these codes are from the data dictionary.
#You'll have to set your file pane location here
setwd("~/Code/SunlightStops/SunlightStopsStory/acs2017_5yr_B02001_15000US390351093011")
CLE_RaceInfoByLocation <- sf::st_read("acs2017_5yr_B02001_15000US390351093011.shp")
## Reading layer `acs2017_5yr_B02001_15000US390351093011' from data source `/Users/Lucia/Code/SunlightStops/SunlightStopsStory/acs2017_5yr_B02001_15000US390351093011/acs2017_5yr_B02001_15000US390351093011.shp' using driver `ESRI Shapefile'
## Simple feature collection with 467 features and 22 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.2302 ymin: -14.60134 xmax: 179.8597 ymax: 71.43979
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
CLE_RaceInfoByLocation <- CLE_RaceInfoByLocation %>%
mutate(
Total=B02001001,
White_Alone=B02001002,
Black_Alone=B02001003,
PctWhite = round((White_Alone/Total)*100, 1),
PctBlack = round((Black_Alone/Total)*100, 1)) %>%
slice(-n())
# We only need the columns with the latitude and longitude
#We need the street stops only here. Stopping somone on the highway is not really in a neighborhood
CLEstops <- rio::import("CLE_stops_cleaned_and_geocoded.csv")
CLEstops <- dplyr::distinct(CLEstops)
CLEstops$Ticket_Date <- mdy(CLEstops$Ticket_Date)
CLEstops$Ticket_Time <-parse_hm(CLEstops$Ticket_Time)
CLEstops <- CLEstops %>%
mutate(date=Ticket_Date, time=Ticket_Time, subject_race=Race)
# Check and eliminate the rows that are on a highway, which really isn't "in" a neighborhood
CLEstops <- CLEstops %>%
filter(!str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))
#Getting Cleveland block groups
library(tigris)
options(tigris_class = "sf")
CLEBlockGroups <- block_groups("OH",county = c("Cuyahoga","Lorain","Lake","Medina"))
CLEstops_spatial <- CLEstops %>%
st_as_sf(coords=c("lon", "lat"), crs="+proj=longlat") %>%
st_transform(crs=st_crs(CLEBlockGroups))
#Putting them altogether
CLEStopsWithTracts <- st_join(CLEBlockGroups, CLEstops_spatial)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
#plotting them
ggplot(CLEBlockGroups) +
geom_sf(color="black") +
geom_point(data=CLEstops, aes(x=lon, y=lat), color="red", alpha=0.05) +
theme_void() +
theme(panel.grid.major = element_line(colour = 'transparent')) +
labs(title="Traffic stops in Cleveland")
#Vlookup
CLEStopsWithTracts$GEOID <- paste0("15000US",CLEStopsWithTracts$GEOID)
This puts the stops and race percentage of each block together.
CLEstops_With_Race_and_Tracts <- CLEStopsWithTracts
CLEstops_With_Race_and_Tracts$PctWhite = CLE_RaceInfoByLocation$PctWhite[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]
CLEstops_With_Race_and_Tracts$PctBlack = CLE_RaceInfoByLocation$PctBlack[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]
CLEstops_With_Race_and_Tracts$Total_People_Per_Census_Block = CLE_RaceInfoByLocation$Total[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]
This looks at the average for each group
CLEstops_With_Race_and_Tracts <- CLEstops_With_Race_and_Tracts %>%
filter(PctWhite!="NA", PctBlack!="NA")
Mostly_White_CLE <- CLEstops_With_Race_and_Tracts %>%
filter(PctWhite>75) %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
Mostly_Black_CLE <- CLEstops_With_Race_and_Tracts %>%
filter(PctBlack>75) %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
mean(Mostly_White_CLE$Per_Capita)
## [1] 0.08560138
mean(Mostly_Black_CLE$Per_Capita)
## [1] 0.120398
This looks at the rate in ALL predominantely black areas and all predominantely white areas together.
And figures out the percentage more in black areas.
Total_People_In_Mostly_White_CLE_Blocks <- sum(Mostly_White_CLE$Total_In_Block)
Total_People_In_Mostly_Black_CLE_Blocks <- sum(Mostly_Black_CLE$Total_In_Block)
Total_Stops_In_Mostly_White_CLE_Blocks <-sum(Mostly_White_CLE$Total)
Total_Stops_In_Mostly_Black_CLE_Blocks <-sum(Mostly_Black_CLE$Total)
Total_Stops_In_Mostly_White_CLE_Blocks/Total_People_In_Mostly_White_CLE_Blocks
## [1] 0.07747141
Total_Stops_In_Mostly_Black_CLE_Blocks/Total_People_In_Mostly_Black_CLE_Blocks
## [1] 0.09729283
CLE_white_rate <- Total_Stops_In_Mostly_White_CLE_Blocks/Total_People_In_Mostly_White_CLE_Blocks
CLE_black_rate <- Total_Stops_In_Mostly_Black_CLE_Blocks/Total_People_In_Mostly_Black_CLE_Blocks
(CLE_black_rate-CLE_white_rate)/CLE_white_rate
## [1] 0.2558545
For the map, we want to look at each tract, and see the percentage stopped for black, white, and other. (And then Tableau has built-in race by block shading.)
CLE_Percent_Pulled_over_By_Block <- CLEstops_With_Race_and_Tracts %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
CLE_Percent_Pulled_over_By_Block <- st_set_geometry(CLE_Percent_Pulled_over_By_Block, NULL)
#Uncomment this to export it
#rio::export(CLE_Percent_Pulled_over_By_Block, "Cleveland number of stops per census block.csv")
## Load the data
CLE_speeders <- rio::import("Cleveland just speeders geocoded.xlsx")
CLEstops <-CLE_speeders
# Additional data and fixed values we'll be using
population_2017 <- tibble(
subject_race = c(
"asian/pacific islander", "black", "hispanic", "other/unknown","white"
),
num_people = c(7930, 193817, 43466, 11508, 131338) #This is where I got this data:https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=CF
#note "white" means 'non-hispanic white'
) %>%
mutate(subject_race = as.factor(subject_race))
#Have to add in the latitude and longitude of wherever you are (this is slightly different for CLEumbus)
center_lat <- 41.4993
center_lng <- -81.6944
Note the original data set has more rows, because each offense is a separate row. So someone getting two tickets would show up twice. And we didn’t want that.
colnames(CLEstops)
## [1] "TicketType" "Ticket_Date" "TicketDatetime" "Ticket_Time"
## [5] "TicketAddress" "Full_address" "DOBTicketed" "Arrest"
## [9] "Race" "Sex" "Statute" "Speeding?"
## [13] "lon" "lat"
Looking for duplicates
CLE_duplicates <- CLEstops %>%
group_by(Ticket_Date, Ticket_Time, DOBTicketed) %>%
summarize (Total=n())
head(CLE_duplicates)
## # A tibble: 6 x 4
## # Groups: Ticket_Date, Ticket_Time [6]
## Ticket_Date Ticket_Time DOBTicketed Total
## <chr> <chr> <chr> <int>
## 1 1/1/16 7:46 Jul 13 1989 1
## 2 1/1/17 0:43 Aug 20 1955 12:00 AM 1
## 3 1/1/17 10:25 Sep 26 1981 1
## 4 1/1/17 10:35 Oct 17 1967 12:00 AM 1
## 5 1/1/17 15:25 Feb 14 1966 12:00 AM 1
## 6 1/1/17 15:34 Dec 22 1981 12:00 AM 1
18 people got two speeding tickets each, or they’re twins. Or this is miscoded. But not going to affect ~15,000 stops.
CLEstops <- dplyr::distinct(CLEstops)
Changing data and time to date and time formats
#CLEstops <- filter(CLEstops, !is.na(Ticket_Date))
CLEstops$Ticket_Date <- mdy(CLEstops$Ticket_Date)
CLEstops$Ticket_Time <-parse_hm(CLEstops$Ticket_Time)
CLEstops <- CLEstops %>%
mutate(date=Ticket_Date, time=Ticket_Time, subject_race=Race)
max(CLEstops$date)
## [1] "2017-12-30"
min(CLEstops$date)
## [1] "2016-01-01"
How does Cleveland count race?
CLEstops %>%
count(subject_race)
## # A tibble: 9 x 2
## subject_race n
## <chr> <int>
## 1 American 23
## 2 Americar 11
## 3 Arab 108
## 4 Asian 76
## 5 Black 7854
## 6 Native H 3
## 7 Native Ha 3
## 8 Unknown 624
## 9 White 6486
There’s some ambiguity in this data. Like, are Hispanics tagged as white? Does “Native H” mean Native American? “America!”… what does that mean? We have sent a request to the police about this.
You might want to do options(scipen=999) so that you don’t get scientific notation here.
# This method builds off of using `count` as above
CLEstops %>%
count(subject_race) %>%
mutate(prop = n / sum(n))
## # A tibble: 9 x 3
## subject_race n prop
## <chr> <int> <dbl>
## 1 American 23 0.00151
## 2 Americar 11 0.000724
## 3 Arab 108 0.00711
## 4 Asian 76 0.00500
## 5 Black 7854 0.517
## 6 Native H 3 0.000198
## 7 Native Ha 3 0.000198
## 8 Unknown 624 0.0411
## 9 White 6486 0.427
# This method uses the group_by/summarize paradigm
CLEstops %>%
group_by(subject_race) %>%
summarize(
n = n(),
prop = n / nrow(.)
)
## # A tibble: 9 x 3
## subject_race n prop
## <chr> <int> <dbl>
## 1 American 23 0.00151
## 2 Americar 11 0.000724
## 3 Arab 108 0.00711
## 4 Asian 76 0.00500
## 5 Black 7854 0.517
## 6 Native H 3 0.000198
## 7 Native Ha 3 0.000198
## 8 Unknown 624 0.0411
## 9 White 6486 0.427
At first glance, we see there are more stops of black drivers than any other race.
How about counting how many stops by year and race?
CLEstops %>%
group_by(year(date)) %>%
# notice that you can also mutate `date` *within* the count funciton
count(year = year(date), subject_race)
## # A tibble: 18 x 4
## # Groups: year(date) [2]
## `year(date)` year subject_race n
## <dbl> <dbl> <chr> <int>
## 1 2016 2016 American 13
## 2 2016 2016 Americar 6
## 3 2016 2016 Arab 61
## 4 2016 2016 Asian 40
## 5 2016 2016 Black 4269
## 6 2016 2016 Native H 1
## 7 2016 2016 Native Ha 2
## 8 2016 2016 Unknown 350
## 9 2016 2016 White 3426
## 10 2017 2017 American 10
## 11 2017 2017 Americar 5
## 12 2017 2017 Arab 47
## 13 2017 2017 Asian 36
## 14 2017 2017 Black 3585
## 15 2017 2017 Native H 2
## 16 2017 2017 Native Ha 1
## 17 2017 2017 Unknown 274
## 18 2017 2017 White 3060
A simple visualization of this.
CLEstops %>%
count(month = month(date), subject_race) %>%
ggplot(aes(x = month, y = n, color = subject_race)) +
geom_point() +
geom_line()
Looks pretty consistent month to month.
Note: For the purposes of our analysis, we are going to combine “Native H”, Clevelandm, America, Americar, Arab, and Unknown together.
CLEstops$subject_race <-recode(CLEstops$subject_race, "Native H"="other/unknown",
"Clevelam"="hispanic",
"America!"="other/unknown",
"Americar"="other/unknown",
"American"="other/unknown",
"Native Ha"="other/unknown",
"Arab"="other/unknown",
"Unknown"="other/unknown",
"Asian"="asian/pacific islander",
"Black"="black",
"White"="white")
This preps the proportion.
population_2017 <- population_2017 %>%
mutate(Proportion_Of_Population = num_people / sum(num_people)) #I added this to the populatin data frame
This creates our stop rate data table.
StopRateDataTable <- CLEstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>%
left_join(
population_2017,
by = "subject_race"
) %>%
mutate(stop_rate = n / sum(n))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable)
So it looks like blacks are about 50 percent of the population though they are 60 percent of the stops. There’s a big discrepancy here though: Hispanics are 11 percent of the population, but there is no Cleveland “race” stop labeled “Hispanic.” (We put in one just so that it actually showed up in this table.) This isn’t inaccurate– legally, Hispanic is an “ethnicity” and not a “race.” Though 200 are listed as “Arab” which isn’t a race either. Again, we will have to check with the police department on that.
population_2017 <- tibble(
subject_race = c(
"asian/pacific islander", "black","other/unknown","white"
),
num_people = c(8026, 196006, 30062, 154718) #This is from the 2017 American Community Survey of Cleveland. "Other/unknown" is: 12,479 for 'Some other Race', 60 for native Hawaiian, 1931 for American Indian/Alaskan Native, 15,592 for 'two or more races.' Food for thought: should we include the black and white as (6919) as part of African American?
) %>%
mutate(subject_race = as.factor(subject_race))
population_2017 <- population_2017 %>%
mutate(Proportion_Of_Population = num_people / sum(num_people))
CLEstops$subject_race <-recode(CLEstops$subject_race, "hispanic"="other/unknown")
StopRateDataTable_No_Hispanic <- CLEstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>%
left_join(
population_2017,
by = "subject_race"
) %>%
mutate(stop_rate = n / sum(n))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable_No_Hispanic)
sunset and dusk times for Cleveland.
(Note that we distinguish here between sunset and dusk. Sunset is the point in time where the sun dips below the horizon. Dusk, also known as the end of civil twilight, is the time when the sun is six degrees below the horizon, and it is widely considered to be dark.)
# Get timezone for Cleveland
tz <- lutz::tz_lookup_coords(center_lat, center_lng, warn = F)
# Helper function
time_to_minute <- function(time) {
hour(lubridate::hms(time)) * 60 + minute(lubridate::hms(time))
}
# Compute sunset time for each date in our dataset
sunset_times <- CLEstops %>%
mutate(
lat = center_lat,
lon = center_lng
) %>%
select(date, lat, lon) %>%
distinct() %>%
getSunlightTimes(
data = .,
keep = c("sunset", "dusk"),
tz = tz
) %>%
mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>%
mutate(
sunset_minute = time_to_minute(sunset),
dusk_minute = time_to_minute(dusk),
date = ymd(str_sub(date, 1, 10))
) %>%
select(date, sunset, dusk, ends_with("minute"))
sunset_times
, what is Cleveland’s inter-twilight period?sunset_times %>%
filter(dusk == min(dusk) | dusk == max(dusk))
## date sunset dusk sunset_minute dusk_minute
## 1 2016-06-26 21:05:57 21:39:57 1265 1299
## 2 2016-12-07 16:58:19 17:29:29 1018 1049
## 3 2017-12-08 16:58:18 17:29:29 1018 1049
## 4 2017-06-26 21:05:56 21:39:57 1265 1299
## 5 2016-12-08 16:58:18 17:29:29 1018 1049
## 6 2017-12-07 16:58:20 17:29:29 1018 1049
datatable(.Last.value)
Joining sunset times to our data.
CLE_VODstops <- CLEstops %>%
left_join(
sunset_times,
by = "date"
) %>%
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>%
filter(
# Filter to get only the intertwilight period
###Skipping this part
minute >= min_dusk_minute,
minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk (twilight)
!(minute > sunset_minute & minute < dusk_minute)
)
Filtering for just the timeframe needed.
CLE_VODstops <- CLE_VODstops %>%
filter(time > hm("16:58"), time < hm("21:39"))
library(dplyr)
CLE_VODstops <- CLE_VODstops %>%
filter(str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))
StopsByYearInDarkness <- CLE_VODstops %>%
filter(is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total) %>%
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CLE_VODstops %>%
filter(!is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total)%>%
mutate(sunshine = "Daylight")
TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>%
group_by(year)
datatable(TwilightTable)
TwilightTable <- TwilightTable %>%
mutate(other = `other/unknown`, total_stopped = `black` +`white`+`other/unknown`) %>%
group_by(sunshine) %>%
mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
#select(black, white, other, year) %>%
datatable(TwilightTable)
TwilightTableTogether <- TwilightTable %>%
group_by(sunshine) %>%
summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
Percent_of_Whites_Stopped=mean(white_percentage),
other=mean(other_percentage))
datatable(TwilightTableTogether)
CLE_VODstops <- CLEstops %>%
left_join(
sunset_times,
by = "date"
) %>%
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>%
filter(
# Filter to get only the intertwilight period
###Skipping this part
minute >= min_dusk_minute,
minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk (twilight)
!(minute > sunset_minute & minute < dusk_minute)
)
taking out the highways, looking for just street stops.
CLE_VODstops <- CLE_VODstops %>%
filter(!str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))
Filtering for just the timeframe needed.
CLE_VODstops <- CLE_VODstops %>%
filter(time > hm("16:58"), time < hm("21:39"))
StopsByYearInDarkness <- CLE_VODstops %>%
filter(is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total) %>%
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CLE_VODstops %>%
filter(!is_dark) %>%
mutate(year = year(date)) %>%
group_by(year,subject_race) %>%
summarize(Total =n()) %>%
spread(subject_race, Total)%>%
mutate(sunshine = "Daylight")
TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>%
group_by(year)
datatable(TwilightTable)
TwilightTable <- TwilightTable %>%
mutate(other = `other/unknown`, total_stopped = `black` +`white`+`other/unknown`) %>%
group_by(sunshine) %>%
mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
#select(black, white, other, year) %>%
datatable(TwilightTable)
TwilightTableTogether <- TwilightTable %>%
group_by(sunshine) %>%
summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
Percent_of_Whites_Stopped=mean(white_percentage),
other=mean(other_percentage))
datatable(TwilightTableTogether)
Looking at Arrest Rates
ArrestsByRace <- CLEstops %>%
group_by(subject_race, Arrest) %>%
summarize(Total =n()) %>%
group_by(subject_race) %>%
mutate(pct=Total/sum(Total))
datatable(ArrestsByRace)
ArrestsRateByRace <- na.omit(ArrestsByRace)
ggplot(ArrestsRateByRace, aes(x=subject_race, y=Total, fill=Arrest))+
geom_col(position="dodge")
#ggsave("Cincinnati2012to2017ArrestRateByRace.jpg", plot = last_plot())
Arrest_Percentage_ByRace <- CLEstops %>%
filter(Arrest=="Yes") %>%
group_by(subject_race) %>%
summarize(Total=n()) %>%
mutate(pct=Total/sum(Total))
datatable(Arrest_Percentage_ByRace)
Note: Cleveland doesn’t have search data.
Note these codes are from the (data dictionary.)[https://www.socialexplorer.com/data/ACS2016_5yr/metadata/?ds=ACS16_5yr&table=B02001]
#You'll have to set your file pane location here
setwd("~/Code/SunlightStops/SunlightStopsStory/acs2017_5yr_B02001_15000US390351093011")
CLE_RaceInfoByLocation <- sf::st_read("acs2017_5yr_B02001_15000US390351093011.shp")
## Reading layer `acs2017_5yr_B02001_15000US390351093011' from data source `/Users/Lucia/Code/SunlightStops/SunlightStopsStory/acs2017_5yr_B02001_15000US390351093011/acs2017_5yr_B02001_15000US390351093011.shp' using driver `ESRI Shapefile'
## Simple feature collection with 467 features and 22 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.2302 ymin: -14.60134 xmax: 179.8597 ymax: 71.43979
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
CLE_RaceInfoByLocation <- CLE_RaceInfoByLocation %>%
mutate(
Total=B02001001,
White_Alone=B02001002,
Black_Alone=B02001003,
PctWhite = round((White_Alone/Total)*100, 1),
PctBlack = round((Black_Alone/Total)*100, 1)) %>%
slice(-n())
# We only need the columns with the latitude and longitude
#We need the street stops only here. Stopping somone on the highway is not really in a neighborhood
#CLEstops <- rio::import("CLE_stops_cleaned_and_geocoded.csv")
CLEstops <- dplyr::distinct(CLEstops)
CLEstops$Ticket_Date <- mdy(CLEstops$Ticket_Date)
## Warning: All formats failed to parse. No formats found.
CLEstops$Ticket_Time <-parse_hm(CLEstops$Ticket_Time)
CLEstops <- CLEstops %>%
mutate(date=Ticket_Date, time=Ticket_Time, subject_race=Race)
# Check and eliminate the rows that are on a highway, which really isn't "in" a neighborhood
CLEstops <- CLEstops %>%
filter(!str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))
#Getting Cleveland block groups
library(tigris)
options(tigris_class = "sf")
CLEBlockGroups <- block_groups("OH",county = c("Cuyahoga","Lorain","Lake","Medina"))
CLEstops <- filter(CLEstops, !is.na(lat))
CLEstops_spatial <- CLEstops %>%
st_as_sf(coords=c("lon", "lat"), crs="+proj=longlat") %>%
st_transform(crs=st_crs(CLEBlockGroups))
#Putting them altogether
CLEStopsWithTracts <- st_join(CLEBlockGroups, CLEstops_spatial)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
#plotting them
ggplot(CLEBlockGroups) +
geom_sf(color="black") +
geom_point(data=CLEstops, aes(x=lon, y=lat), color="red", alpha=0.05) +
theme_void() +
theme(panel.grid.major = element_line(colour = 'transparent')) +
labs(title="Traffic stops in Cleveland")
#Vlookup
CLEStopsWithTracts$GEOID <- paste0("15000US",CLEStopsWithTracts$GEOID)
This puts the stops and race percentage of each block together.
CLEstops_With_Race_and_Tracts <- CLEStopsWithTracts
CLEstops_With_Race_and_Tracts$PctWhite = CLE_RaceInfoByLocation$PctWhite[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]
CLEstops_With_Race_and_Tracts$PctBlack = CLE_RaceInfoByLocation$PctBlack[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]
CLEstops_With_Race_and_Tracts$Total_People_Per_Census_Block = CLE_RaceInfoByLocation$Total[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]
CLEstops_With_Race_and_Tracts <- CLEstops_With_Race_and_Tracts %>%
filter(PctWhite!="NA", PctBlack!="NA")
Mostly_White_CLE <- CLEstops_With_Race_and_Tracts %>%
filter(PctWhite>75) %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
Mostly_Black_CLE <- CLEstops_With_Race_and_Tracts %>%
filter(PctBlack>75) %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))
mean(Mostly_White_CLE$Per_Capita)
## [1] 0.02781577
mean(Mostly_Black_CLE$Per_Capita)
## [1] 0.0487613
This looks at the rate in ALL predominantely black areas and all predominantely white areas together.
And figures out the percentage more in black areas.
Total_People_In_Mostly_White_CLE_Blocks <- sum(Mostly_White_CLE$Total_In_Block)
Total_People_In_Mostly_Black_CLE_Blocks <- sum(Mostly_Black_CLE$Total_In_Block)
Total_Stops_In_Mostly_White_CLE_Blocks <-sum(Mostly_White_CLE$Total)
Total_Stops_In_Mostly_Black_CLE_Blocks <-sum(Mostly_Black_CLE$Total)
Total_Stops_In_Mostly_White_CLE_Blocks/Total_People_In_Mostly_White_CLE_Blocks
## [1] 0.02698263
Total_Stops_In_Mostly_Black_CLE_Blocks/Total_People_In_Mostly_Black_CLE_Blocks
## [1] 0.03113181
CLE_white_rate <- Total_Stops_In_Mostly_White_CLE_Blocks/Total_People_In_Mostly_White_CLE_Blocks
CLE_black_rate <- Total_Stops_In_Mostly_Black_CLE_Blocks/Total_People_In_Mostly_Black_CLE_Blocks
(CLE_black_rate-CLE_white_rate)/CLE_white_rate
## [1] 0.153772