This project looks at police stops in Cincinnati, Columbus, and Cleveland Ohio as part of Stanford University’s open policing project. Standford developed the base R script for this, including the veil of darkness test, and gathered some of the records. A team of Ohio journalists gathered additional records and worked for eights months to add additional analysis, such as the point in polygon part, and to conduct interviews, make graphics, etc.
This is NOT nor does it claim to be an official study. But everything is accurate in here to the best of our knowledge after checking with multiple sources at the time of publication.
First, we are comparing the population of each city, and not the drivers in each city, which will be slightly different. The The Ohio Dept. of Public Safety, which houses the BMV, does not break down drivers by race, which means Census data was the best proxy we could find.
Second, this doesn’t account for people driving into and out of the city. Getting a ticket there doesn’t necessarily mean you live there.
A note on race: Race is counted two different ways. For a full discussion on this, see The Curious Journalist’s guide to data.
Essentially, “Hispanic” is considered an “ethnicity” and not a “race.” So you can be white and Hispanic or black and hispanic. (This also points to the stupidity of treating someone differently because of their race, which of course is the whole story, as we can’t even define it really.)
For Cincinnati and Columbus, the actual number of hispanics was quite low, so we went with the definiton “non-hispanic white.” In Cleveland, where the population is 11 percent Hispanic, police didn’t mark the race of anyone with a ticket as “Hispanic.” (Though curiously they did mark 200 as “Arabic.”) So to more accurately portray Cleveland, we went with “White” including hispanics.
This doesn’t really affect our analysis though because:
there were very few Hispanic Blacks in the census
it’s not possible to tell from the data who is hispanic anyway.
So this data probably does not really tell us anything about Hispanics, and it’s not possible to draw any conculsions from their results.
Other caveats:
We are talking about census blocks here, so the smaller the number counted, the larger the margin of error.
Also, geocoding has a block margin of error, since the Cincinnati police code stops as 12XX Main Street. This isn’t a huge error, but it gets larger when we’re talking about a small area like a census block group.
Every large datatset will have a few errors, and ours was no exception. We had to exclude a small subset of stops that were not propertly geocoded, and we had to exclude a small subset of stops with missing information.
The veil of darkess test assumes that drivers behave the same way at night and during the day. There is some evidence that this is not always the case. See, for example, here. See Veil of Darkness full paper for a more in-depth analysis.
Thank you to the following for providing ggmap: D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1),144-161. link
loading libraries and data
## Libraries to include
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(dplyr)
library(lutz)
library(suncalc)
library(splines)
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(tabulizer)
library(pdftables)
library(hms)
##
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
##
## hms
library(DT)
## Load the data
# Replace the path below with the path to where your data lives on your computer that you are using this script in
#data_path <- "~/Code/SunlightStops/SunlightStopsStory/oh_cincinnati_2019_02_25.csv"
#data_path <- "~/Code/SunlightStops/SunlightStopsStory/CinStopsByBlock2017.csv"
#CINstops <- read_csv(data_path)
#table(CINstops$subject_race)
CINstops <- rio::import("https://stacks.stanford.edu/file/druid:hp256wp2687/hp256wp2687_oh_cincinnati_2019_08_13.csv.zip")
#CINstops <- rio::import("CinStopsWithRaceByBlock.xlsx")
# Additional data and fixed values we'll be using
population_2016 <- tibble(
subject_race = c(
"asian/pacific islander", "black", "hispanic", "other/unknown","white"
),
CINnum_people = c(5418, 127615, 9554, 8489, 145512) #This is where I got this data: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=CF
) %>%
mutate(subject_race = as.factor(subject_race))
#Have to add in the latitude and longitude of wherever you are (this is slightly different for Columbus)
CINcenter_lat <- 39.1031
CINcenter_lng <- -84.5120
What this dataset offers:
colnames(CINstops)
## [1] "raw_row_number" "date"
## [3] "time" "location"
## [5] "lat" "lng"
## [7] "neighborhood" "beat"
## [9] "subject_race" "subject_sex"
## [11] "officer_assignment" "type"
## [13] "disposition" "arrest_made"
## [15] "citation_issued" "warning_issued"
## [17] "outcome" "reason_for_stop"
## [19] "vehicle_make" "vehicle_model"
## [21] "vehicle_registration_state" "vehicle_year"
## [23] "raw_race" "raw_action_taken_cid"
## [25] "raw_field_subject_cid"
How many stops do we have in our dataset?
nrow(CINstops)
## [1] 315281
summary(CINstops)
## raw_row_number date time
## Length:315281 Length:315281 Length:315281
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## location lat lng neighborhood
## Length:315281 Min. :-84.69 Min. :-97.60 Length:315281
## Class :character 1st Qu.: 39.11 1st Qu.:-84.54 Class :character
## Mode :character Median : 39.13 Median :-84.52 Mode :character
## Mean : 38.23 Mean :-83.62
## 3rd Qu.: 39.15 3rd Qu.:-84.50
## Max. : 45.53 Max. : 39.27
## NA's :261 NA's :261
## beat subject_race subject_sex
## Length:315281 Length:315281 Length:315281
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## officer_assignment type disposition arrest_made
## Length:315281 Length:315281 Length:315281 Mode :logical
## Class :character Class :character Class :character FALSE:269176
## Mode :character Mode :character Mode :character TRUE :45879
## NA's :226
##
##
##
## citation_issued warning_issued outcome reason_for_stop
## Mode :logical Mode :logical Length:315281 Length:315281
## FALSE:132897 FALSE:258896 Class :character Class :character
## TRUE :182158 TRUE :56159 Mode :character Mode :character
## NA's :226 NA's :226
##
##
##
## vehicle_make vehicle_model vehicle_registration_state
## Length:315281 Length:315281 Length:315281
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## vehicle_year raw_race raw_action_taken_cid
## Min. : 1932 Length:315281 Length:315281
## 1st Qu.: 1998 Class :character Class :character
## Median : 2002 Mode :character Mode :character
## Mean : 2002
## 3rd Qu.: 2006
## Max. :16998
## NA's :2922
## raw_field_subject_cid
## Length:315281
## Class :character
## Mode :character
##
##
##
##
Because the date is a character here, and not a date, so we’re changing it to a date format.
CINstops <- filter(CINstops, !is.na(date))
CINstops$date <- as.Date(CINstops$date, format = "%Y-%m-%d")
max(CINstops$date)
## [1] "2018-05-28"
Putting times in a time vector.
library(hms)
CINstops$time<-parse_hms(CINstops$time)
Looking at stops per year
CINstops %>%
group_by(year(date)) %>%
summarize(Total=n())
## # A tibble: 10 x 2
## `year(date)` Total
## <dbl> <int>
## 1 2009 54630
## 2 2010 52664
## 3 2011 43446
## 4 2012 37876
## 5 2013 25646
## 6 2014 25440
## 7 2015 25372
## 8 2016 21733
## 9 2017 20415
## 10 2018 8059
CINstops <- CINstops %>%
filter(year(date) < 2018, year(date) > 2008) #Cincy only has part of 2018 and the data from before 2009 is spotty so we filtered that out
If you just want to look at minor stops.
CINstops %>%
group_by(reason_for_stop) %>%
summarize(Total=n())
## # A tibble: 146 x 2
## reason_for_stop Total
## <chr> <int>
## 1 <NA> 276001
## 2 --- DISCON CALL 2
## 3 ABDUCTION 1
## 4 ACCIDENT NO INJURIES 53
## 5 ACCIDENT WITH INJURIES 1
## 6 ACCIDENT-VEH INTO BLDG 1
## 7 ADVISED COMPLAINT 22
## 8 ADVISED INCIDENT 9
## 9 ANIMAL COMPLAINT 7
## 10 ASSAULT IN PROGRESS 13
## # … with 136 more rows
#unique(CINstops$reason_for_stop)
Taking out anything that doesn’t sound like a traffic stop:
“911 DISCON CALL” “ACCIDENT NO INJURIES” “ACCIDENT WITH INJURIES” “ACCIDENT-VEH INTO BLDG” “ADVISED COMPLAINT”
“ADVISED INCIDENT” “ANIMAL COMPLAINT” “ASSAULT IN PROGRESS” “ASSAULT J/O-NO INJS” “ASSAULT REPORT”
“ATTEMPT TO LOCATE” “AUTO ACC INJ-POL ONL” “AUTO ACCIDENT - NO I” “AUTO ACCIDENT INJURI” “AUTO CRASH INTO BUIL”
, “AUTO THEFT IN PROGRE” “AUTO THEFT J/O” “AUTO THEFT OR RECOVE” “AUTO THEFT REPORT” “B&E REPORT”
,“BACKUP REQUIRED” “BE IN PROG/JO” “BREAKING & ENTERING IN PROGRESS” “CAR IN VIOLATION” “CELL DISCON OR SICAL”
, “CHILD-VICTIM OF CRIM” “COMPLAINT OF PANHAND” “COMPLAINT OF PROSTIT” “CRIMINAL DAMAGING” “CRIMINAL DAMAGING IN”
, “CRIMINAL DAMAGING RE” “CURFEW VIOLATION” “CUTTING IN PROG/JO” “DEFAULT INCIDENT TYP” “DIRECT ALARM”
“DOMESTIC VIOLENCE” “DOMESTIC VIOLENCE J/O” “DRUG ACTIVITY”
, “DRUG COMPLAINT NOT I” “FAMILY TROUBLE” “FAMILY TROUBLE - NON VIOLENT” “FIGHT IN PROGRESS” “FIRE POLICE REQUEST”
, “FIREWORKS” “GENERAL INFO BROADCAST” “GENERAL INFORMATION-” “HEROIN OD” “KEYS LOCKED IN AUTO”
, “LARGE GROUP ASSAULT” “MAINTENANCE RUN” “MEDIA REQUEST FOR IN” “MEET OTHER LAW ENFORCEMENT/AGENCY” “MEET POLICE OFFICER-”
, “MENACING IN PROG/JO” “MENACING J/O” “MENTALLY IMPAIRED NON VIOL” “MENTALLY IMPAIRED VIOL” “MENTALLY IMPAIRED-NO”
, “MISSING PERSON REPOR” “NEIGHBOR TROUBLE” “NEIGHBOR TROUBLE - NON VIOL” “NOISE COMPLAINT” “NON-RESIDENTIAL BURG”
, “NONRESIDENTIAL ALARM DROP” “OFF DUTY POLICE DETAILS” “OFFICER BACKUP/NON ASSISTANCE” “PANHANDLER” “PERS DOWN POL ONLY” “PERSON ARMED W/ GUN” “PERSON ARMED W/WEAPON” “PERSON DOWN - UNK REASON” “PERSON DOWN AND OUT”
, “PERSON W/ WEAPON (OT” “PRISONER” “PROSTITUTE COMPLAINT” “PROWLER” “RAPE J/O OR IN PROGRESS” “REPO” “RESIDENTIAL ALARM DROP” “RESIDENTIAL BURGLAR” “ROBBERY IN PROGRESS-”
, “ROBBERY J/O OR IN PROGRESS” “ROBBERY REPORT” “SERVICE” “SEX OFFENSE OTHER THAN RAPE” “SEX OFFENSES OTHER T”
, “SHOOTING ALREADY OCC” “SHOOTING JO POL ONLY” “SHOT SPOTTER ACTIVITY” “SHOTS FIRED” “SPECIAL DETAIL”
, “STALKING J/O OR IN PROGRESS” “STATION RUN” “STRUCTURE FIRE” “TEST INCIDENT” “THEFT IN PROG/JO” “THEFT J/O OR IN PROGRESS” “THEFT REPORT” “TRANSFERRED CALL” “TRESPASSER” “TRESPASSERS”
,“TOWED VEH - FOR RECORDS USE ONLY” “WARRANT SERVICE”
, “HOLDUP/PANIC ALARM”
The majority are not listed. We also kept
CINstops <- CINstops %>%
filter(
str_detect(reason_for_stop, 'PATROL|DISORD|HAZARD|INV|PARKING|QUERY|SUSPECT|SUSPICIOUS|TRAFFIC|UNKNOWN')| is.na(reason_for_stop))
‘DIRECTED PATROL|DIRECTED PATROL - VEHICLE|DIRECTED PATROL - WALKING|DIRECTED PATROL-WALK| DISORD GROUP|DISORDERLY CROWD|DISORDERLY PERSON|HAZARD TO TRAFFIC/PEDESTRIAN|INV DRUG COMPLAINT|INV POSSIBLE WANTED|INV POSSIBLE WANTED PERSON|INV UNKNOWN TROUBLE|INVESTIGATE|INVESTIGATION|JUVENILE COMPLAINT|PARKING VIOLATION|QUERY|SUSPECT STOP |SUSPICIOUS PERSONS OR SITUATION|SUSPICIOUS SITUATION|TRAFFIC HAZARD|TRAFFIC PURSUIT|TRAFFIC STOP|TRAFFIC TIE|UNKNOWN TROUBLE’), is.na(reason_for_stop))
Looking at Arrest Rates- within each race
ArrestsByRace <- CINstops %>%
group_by(subject_race, arrest_made) %>%
summarize(Total =n()) %>%
group_by(subject_race) %>%
mutate(pct=Total/sum(Total))
datatable(ArrestsByRace)
A plot of that.
ArrestsRateByRace <- na.omit(ArrestsByRace)
ggplot(ArrestsRateByRace, aes(x=subject_race, y=Total, fill=arrest_made))+
geom_col(position="dodge")