This project looks at police stops in Cincinnati, Columbus, and Cleveland Ohio as part of Stanford University’s open policing project. Standford developed the base R script for this, including the veil of darkness test, and gathered some of the records. A team of Ohio journalists gathered additional records and worked for eights months to add additional analysis, such as the point in polygon part, and to conduct interviews, make graphics, etc.

Methodology and Caveats:

This is NOT nor does it claim to be an official study. But everything is accurate in here to the best of our knowledge after checking with multiple sources at the time of publication.

First, we are comparing the population of each city, and not the drivers in each city, which will be slightly different. The The Ohio Dept. of Public Safety, which houses the BMV, does not break down drivers by race, which means Census data was the best proxy we could find.

Second, this doesn’t account for people driving into and out of the city. Getting a ticket there doesn’t necessarily mean you live there.

A note on race: Race is counted two different ways. For a full discussion on this, see The Curious Journalist’s guide to data.

Essentially, “Hispanic” is considered an “ethnicity” and not a “race.” So you can be white and Hispanic or black and hispanic. (This also points to the stupidity of treating someone differently because of their race, which of course is the whole story, as we can’t even define it really.)

For Cincinnati and Columbus, the actual number of hispanics was quite low, so we went with the definiton “non-hispanic white.” In Cleveland, where the population is 11 percent Hispanic, police didn’t mark the race of anyone with a ticket as “Hispanic.” (Though curiously they did mark 200 as “Arabic.”) So to more accurately portray Cleveland, we went with “White” including hispanics.

This doesn’t really affect our analysis though because:

So this data probably does not really tell us anything about Hispanics, and it’s not possible to draw any conculsions from their results.

Other caveats:

Thank you to the following for providing ggmap: D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1),144-161. link

Cincinnati

loading libraries and data

## Libraries to include
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(dplyr)
library(lutz)
library(suncalc)
library(splines)
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(tabulizer)
library(pdftables)
library(hms)
## 
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
## 
##     hms
library(DT)

## Load the data
# Replace the path below with the path to where your data lives on your computer that you are using this script in 
#data_path <- "~/Code/SunlightStops/SunlightStopsStory/oh_cincinnati_2019_02_25.csv"
#data_path <- "~/Code/SunlightStops/SunlightStopsStory/CinStopsByBlock2017.csv"
#CINstops <- read_csv(data_path)
#table(CINstops$subject_race)


CINstops <- rio::import("https://stacks.stanford.edu/file/druid:hp256wp2687/hp256wp2687_oh_cincinnati_2019_08_13.csv.zip")



#CINstops <- rio::import("CinStopsWithRaceByBlock.xlsx")

# Additional data and fixed values we'll be using
population_2016 <- tibble(
subject_race = c(
"asian/pacific islander", "black", "hispanic", "other/unknown","white"
),
CINnum_people = c(5418, 127615, 9554, 8489, 145512) #This is where I got this data: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=CF
) %>% 
mutate(subject_race = as.factor(subject_race))

#Have to add in the latitude and longitude of wherever you are (this is slightly different for Columbus)
CINcenter_lat <- 39.1031
CINcenter_lng <- -84.5120

What this dataset offers:

colnames(CINstops)
##  [1] "raw_row_number"             "date"                      
##  [3] "time"                       "location"                  
##  [5] "lat"                        "lng"                       
##  [7] "neighborhood"               "beat"                      
##  [9] "subject_race"               "subject_sex"               
## [11] "officer_assignment"         "type"                      
## [13] "disposition"                "arrest_made"               
## [15] "citation_issued"            "warning_issued"            
## [17] "outcome"                    "reason_for_stop"           
## [19] "vehicle_make"               "vehicle_model"             
## [21] "vehicle_registration_state" "vehicle_year"              
## [23] "raw_race"                   "raw_action_taken_cid"      
## [25] "raw_field_subject_cid"

How many stops do we have in our dataset?

nrow(CINstops)
## [1] 315281
summary(CINstops)
##  raw_row_number         date               time          
##  Length:315281      Length:315281      Length:315281     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    location              lat              lng         neighborhood      
##  Length:315281      Min.   :-84.69   Min.   :-97.60   Length:315281     
##  Class :character   1st Qu.: 39.11   1st Qu.:-84.54   Class :character  
##  Mode  :character   Median : 39.13   Median :-84.52   Mode  :character  
##                     Mean   : 38.23   Mean   :-83.62                     
##                     3rd Qu.: 39.15   3rd Qu.:-84.50                     
##                     Max.   : 45.53   Max.   : 39.27                     
##                     NA's   :261      NA's   :261                        
##      beat           subject_race       subject_sex       
##  Length:315281      Length:315281      Length:315281     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  officer_assignment     type           disposition        arrest_made    
##  Length:315281      Length:315281      Length:315281      Mode :logical  
##  Class :character   Class :character   Class :character   FALSE:269176   
##  Mode  :character   Mode  :character   Mode  :character   TRUE :45879    
##                                                           NA's :226      
##                                                                          
##                                                                          
##                                                                          
##  citation_issued warning_issued    outcome          reason_for_stop   
##  Mode :logical   Mode :logical   Length:315281      Length:315281     
##  FALSE:132897    FALSE:258896    Class :character   Class :character  
##  TRUE :182158    TRUE :56159     Mode  :character   Mode  :character  
##  NA's :226       NA's :226                                            
##                                                                       
##                                                                       
##                                                                       
##  vehicle_make       vehicle_model      vehicle_registration_state
##  Length:315281      Length:315281      Length:315281             
##  Class :character   Class :character   Class :character          
##  Mode  :character   Mode  :character   Mode  :character          
##                                                                  
##                                                                  
##                                                                  
##                                                                  
##   vehicle_year     raw_race         raw_action_taken_cid
##  Min.   : 1932   Length:315281      Length:315281       
##  1st Qu.: 1998   Class :character   Class :character    
##  Median : 2002   Mode  :character   Mode  :character    
##  Mean   : 2002                                          
##  3rd Qu.: 2006                                          
##  Max.   :16998                                          
##  NA's   :2922                                           
##  raw_field_subject_cid
##  Length:315281        
##  Class :character     
##  Mode  :character     
##                       
##                       
##                       
## 

Because the date is a character here, and not a date, so we’re changing it to a date format.

CINstops <- filter(CINstops, !is.na(date))
CINstops$date <- as.Date(CINstops$date, format = "%Y-%m-%d")
max(CINstops$date)
## [1] "2018-05-28"

Putting times in a time vector.

library(hms)
CINstops$time<-parse_hms(CINstops$time)

Looking at stops per year

CINstops %>% 
  group_by(year(date)) %>% 
  summarize(Total=n())
## # A tibble: 10 x 2
##    `year(date)` Total
##           <dbl> <int>
##  1         2009 54630
##  2         2010 52664
##  3         2011 43446
##  4         2012 37876
##  5         2013 25646
##  6         2014 25440
##  7         2015 25372
##  8         2016 21733
##  9         2017 20415
## 10         2018  8059
CINstops <- CINstops %>% 
  filter(year(date) < 2018, year(date) > 2008) #Cincy only has part of 2018 and the data from before 2009 is spotty so we filtered that out

If you just want to look at minor stops.

CINstops %>% 
  group_by(reason_for_stop) %>% 
  summarize(Total=n())
## # A tibble: 146 x 2
##    reason_for_stop         Total
##    <chr>                   <int>
##  1 <NA>                   276001
##  2 --- DISCON CALL             2
##  3 ABDUCTION                   1
##  4 ACCIDENT NO INJURIES       53
##  5 ACCIDENT WITH INJURIES      1
##  6 ACCIDENT-VEH INTO BLDG      1
##  7 ADVISED COMPLAINT          22
##  8 ADVISED INCIDENT            9
##  9 ANIMAL COMPLAINT            7
## 10 ASSAULT IN PROGRESS        13
## # … with 136 more rows
#unique(CINstops$reason_for_stop)

Taking out anything that doesn’t sound like a traffic stop:

“911 DISCON CALL” “ACCIDENT NO INJURIES” “ACCIDENT WITH INJURIES” “ACCIDENT-VEH INTO BLDG” “ADVISED COMPLAINT”
“ADVISED INCIDENT” “ANIMAL COMPLAINT” “ASSAULT IN PROGRESS” “ASSAULT J/O-NO INJS” “ASSAULT REPORT”
“ATTEMPT TO LOCATE” “AUTO ACC INJ-POL ONL” “AUTO ACCIDENT - NO I” “AUTO ACCIDENT INJURI” “AUTO CRASH INTO BUIL”
, “AUTO THEFT IN PROGRE” “AUTO THEFT J/O” “AUTO THEFT OR RECOVE” “AUTO THEFT REPORT” “B&E REPORT”
,“BACKUP REQUIRED” “BE IN PROG/JO” “BREAKING & ENTERING IN PROGRESS” “CAR IN VIOLATION” “CELL DISCON OR SICAL”
, “CHILD-VICTIM OF CRIM” “COMPLAINT OF PANHAND” “COMPLAINT OF PROSTIT” “CRIMINAL DAMAGING” “CRIMINAL DAMAGING IN”
, “CRIMINAL DAMAGING RE” “CURFEW VIOLATION” “CUTTING IN PROG/JO” “DEFAULT INCIDENT TYP” “DIRECT ALARM”
“DOMESTIC VIOLENCE” “DOMESTIC VIOLENCE J/O” “DRUG ACTIVITY”
, “DRUG COMPLAINT NOT I” “FAMILY TROUBLE” “FAMILY TROUBLE - NON VIOLENT” “FIGHT IN PROGRESS” “FIRE POLICE REQUEST”
, “FIREWORKS” “GENERAL INFO BROADCAST” “GENERAL INFORMATION-” “HEROIN OD” “KEYS LOCKED IN AUTO”
, “LARGE GROUP ASSAULT” “MAINTENANCE RUN” “MEDIA REQUEST FOR IN” “MEET OTHER LAW ENFORCEMENT/AGENCY” “MEET POLICE OFFICER-”
, “MENACING IN PROG/JO” “MENACING J/O” “MENTALLY IMPAIRED NON VIOL” “MENTALLY IMPAIRED VIOL” “MENTALLY IMPAIRED-NO”
, “MISSING PERSON REPOR” “NEIGHBOR TROUBLE” “NEIGHBOR TROUBLE - NON VIOL” “NOISE COMPLAINT” “NON-RESIDENTIAL BURG”
, “NONRESIDENTIAL ALARM DROP” “OFF DUTY POLICE DETAILS” “OFFICER BACKUP/NON ASSISTANCE” “PANHANDLER” “PERS DOWN POL ONLY” “PERSON ARMED W/ GUN” “PERSON ARMED W/WEAPON” “PERSON DOWN - UNK REASON” “PERSON DOWN AND OUT”
, “PERSON W/ WEAPON (OT” “PRISONER” “PROSTITUTE COMPLAINT” “PROWLER” “RAPE J/O OR IN PROGRESS” “REPO” “RESIDENTIAL ALARM DROP” “RESIDENTIAL BURGLAR” “ROBBERY IN PROGRESS-”
, “ROBBERY J/O OR IN PROGRESS” “ROBBERY REPORT” “SERVICE” “SEX OFFENSE OTHER THAN RAPE” “SEX OFFENSES OTHER T”
, “SHOOTING ALREADY OCC” “SHOOTING JO POL ONLY” “SHOT SPOTTER ACTIVITY” “SHOTS FIRED” “SPECIAL DETAIL”
, “STALKING J/O OR IN PROGRESS” “STATION RUN” “STRUCTURE FIRE” “TEST INCIDENT” “THEFT IN PROG/JO” “THEFT J/O OR IN PROGRESS” “THEFT REPORT” “TRANSFERRED CALL” “TRESPASSER” “TRESPASSERS”
,“TOWED VEH - FOR RECORDS USE ONLY” “WARRANT SERVICE”
, “HOLDUP/PANIC ALARM”

The majority are not listed. We also kept

CINstops <- CINstops %>% 
  filter(
    str_detect(reason_for_stop, 'PATROL|DISORD|HAZARD|INV|PARKING|QUERY|SUSPECT|SUSPICIOUS|TRAFFIC|UNKNOWN')| is.na(reason_for_stop))                

‘DIRECTED PATROL|DIRECTED PATROL - VEHICLE|DIRECTED PATROL - WALKING|DIRECTED PATROL-WALK| DISORD GROUP|DISORDERLY CROWD|DISORDERLY PERSON|HAZARD TO TRAFFIC/PEDESTRIAN|INV DRUG COMPLAINT|INV POSSIBLE WANTED|INV POSSIBLE WANTED PERSON|INV UNKNOWN TROUBLE|INVESTIGATE|INVESTIGATION|JUVENILE COMPLAINT|PARKING VIOLATION|QUERY|SUSPECT STOP |SUSPICIOUS PERSONS OR SITUATION|SUSPICIOUS SITUATION|TRAFFIC HAZARD|TRAFFIC PURSUIT|TRAFFIC STOP|TRAFFIC TIE|UNKNOWN TROUBLE’), is.na(reason_for_stop))

Looking at Arrest Rates- within each race

ArrestsByRace <- CINstops %>% 
  group_by(subject_race, arrest_made) %>% 
  summarize(Total =n()) %>% 
  group_by(subject_race) %>% 
  mutate(pct=Total/sum(Total))
  datatable(ArrestsByRace)

A plot of that.

ArrestsRateByRace <- na.omit(ArrestsByRace) 
ggplot(ArrestsRateByRace, aes(x=subject_race, y=Total, fill=arrest_made))+
geom_col(position="dodge")

#ggsave("Cincinnati2012to2017ArrestRateByRace.jpg", plot = last_plot())

Looking at the arrest percentage overall

Arrest_Percentage_ByRace <- CINstops %>%
  filter(arrest_made=="TRUE") %>%
  group_by(subject_race) %>% 
  summarize(Total=n()) %>% 
  mutate(pct=Total/sum(Total))
  datatable(Arrest_Percentage_ByRace)

Note: Cincy doesn’t have search data.

Filter out NAs in lat/lng

CINstops <- filter(CINstops, lng!="NA")

Counting Race

CINstops %>% 
count(subject_race)
## # A tibble: 6 x 2
##   subject_race                n
##   <chr>                   <int>
## 1 asian/pacific islander   1574
## 2 black                  173446
## 3 hispanic                  497
## 4 other                      21
## 5 unknown                  5866
## 6 white                  124171

Plot of year and race

CINstops %>% 
count(year = year(date), subject_race) %>% 
ggplot(aes(x = year, y = n, color = subject_race)) +
geom_point() +
geom_line() 

This creates a proportion

population_2016 <- population_2016 %>% 
mutate(Proportion_Of_Population = CINnum_people / sum(CINnum_people)) #I added this to the populatin data frame

Stops percentage versus population percentage

StopRateDataTable <- CINstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>% 
left_join(
population_2016,
by = "subject_race"
) %>% 
mutate(stop_rate = n / sum(n))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable)

So it looks like blacks are 43 percent of the population though they are 57 percent of the stops.

Calulating the rate of stops (For All cities)

Source for this is the Mo. Attorney General.

Using these numbers: ( POP: 49% white, 43% black, 8% other)

(Stops: 57% black, 41 white 2% others)

White disparity index value = stop % divided by pop %): (41/49) = 0.8367347

Black DIV: (57/43) = 1.325581

black stop rate is 1.58 times that of white stop rate (1.325581/0.8367347)

Thus, accounting for the racial proportion of Cincinnati’s 2016 population, blacks were stopped at a rate 58% higher than whites.

41/49
## [1] 0.8367347
57/43
## [1] 1.325581
1.325581/0.8367347
## [1] 1.584231

Veil of Darkness Test (Cincinnati)

Filtering for vehichular data

CINstops <- CINstops %>% filter(type == "vehicular") #Note this is pretty much all stop data

Calculating the sunset and dusk times for Cincinnati in 2016.

(Note that we distinguish here between sunset and dusk. Sunset is the point in time where the sun dips below the horizon. Dusk, also known as the end of civil twilight, is the time when the sun is six degrees below the horizon, and it is widely considered to be dark. We’ll explain why this is relevant in a moment.)

# Get timezone for Cincinnati
CINtz <- lutz::tz_lookup_coords(CINcenter_lat, CINcenter_lng, warn = F)

# Helper function
time_to_minute <- function(time) {
hour(lubridate::hms(time)) * 60 + minute(lubridate::hms(time))
}

# Compute sunset time for each date in our dataset
sunset_times <- 
CINstops %>%
mutate(
lat = CINcenter_lat,
lon = CINcenter_lng
) %>% 
select(date, lat, lon) %>%
distinct() %>%
getSunlightTimes(
data = ., 
keep = c("dawn","sunrise", "sunset", "dusk"), 
tz = CINtz
) %>% 
mutate_at(vars("dawn","sunrise","sunset", "dusk"), ~format(., "%H:%M:%S")) %>% 
mutate(
  sunset_minute = time_to_minute(sunset),
  dusk_minute = time_to_minute(dusk),
  dawn_minute = time_to_minute(dawn),
  sunrise_minute = time_to_minute(sunrise),
  date = ymd(str_sub(date, 1, 10))
) %>% 
  select(date, dawn, sunrise, sunset, dusk, ends_with("minute"))

Great. Now, using sunset_times, what is Cincy’s inter-twilight period?

sunset_times %>% 
filter(dusk == min(dusk) | dusk == max(dusk) | dawn == min(dawn) | dawn == max(dawn))
##          date     dawn  sunrise   sunset     dusk sunset_minute
## 1  2010-06-15 05:40:41 06:12:51 21:06:57 21:39:07          1266
## 2  2017-06-15 05:40:41 06:12:51 21:07:01 21:39:11          1267
## 3  2010-11-06 07:44:28 08:12:29 18:33:30 19:01:30          1113
## 4  2012-06-15 05:40:41 06:12:52 21:07:07 21:39:18          1267
## 5  2016-06-26 05:42:55 06:15:05 21:09:19 21:41:30          1269
## 6  2009-06-14 05:40:41 06:12:50 21:06:40 21:38:49          1266
## 7  2012-12-06 07:15:05 07:44:53 17:16:45 17:46:32          1036
## 8  2016-06-15 05:40:41 06:12:52 21:07:06 21:39:17          1267
## 9  2009-12-06 07:14:54 07:44:41 17:16:45 17:46:32          1036
## 10 2011-06-15 05:40:41 06:12:51 21:06:51 21:39:00          1266
## 11 2010-12-07 07:15:31 07:45:20 17:16:44 17:46:32          1036
## 12 2015-12-07 07:15:16 07:45:04 17:16:44 17:46:32          1036
## 13 2017-06-14 05:40:41 06:12:50 21:06:38 21:38:47          1266
## 14 2011-12-07 07:15:18 07:45:06 17:16:44 17:46:32          1036
## 15 2009-06-15 05:40:41 06:12:52 21:07:02 21:39:13          1267
## 16 2012-06-14 05:40:41 06:12:50 21:06:45 21:38:54          1266
## 17 2010-12-06 07:14:41 07:44:27 17:16:46 17:46:32          1036
## 18 2009-06-27 05:43:14 06:15:23 21:09:20 21:41:30          1269
## 19 2014-12-06 07:14:39 07:44:25 17:16:46 17:46:32          1036
## 20 2016-12-06 07:15:03 07:44:51 17:16:45 17:46:32          1036
## 21 2012-06-26 05:42:55 06:15:06 21:09:19 21:41:30          1269
## 22 2012-06-27 05:43:19 06:15:28 21:09:20 21:41:30          1269
## 23 2017-12-06 07:14:50 07:44:37 17:16:45 17:46:32          1036
## 24 2009-06-26 05:42:51 06:15:01 21:09:19 21:41:30          1269
## 25 2011-12-06 07:14:28 07:44:13 17:16:47 17:46:32          1036
## 26 2010-06-27 05:43:07 06:15:17 21:09:20 21:41:30          1269
## 27 2014-06-15 05:40:41 06:12:51 21:06:56 21:39:06          1266
## 28 2013-06-14 05:40:41 06:12:50 21:06:39 21:38:48          1266
## 29 2015-06-27 05:43:01 06:15:11 21:09:20 21:41:30          1269
## 30 2016-06-14 05:40:41 06:12:50 21:06:44 21:38:53          1266
## 31 2013-06-27 05:43:13 06:15:22 21:09:20 21:41:30          1269
## 32 2017-06-27 05:43:12 06:15:21 21:09:20 21:41:30          1269
## 33 2014-12-07 07:15:29 07:45:18 17:16:44 17:46:32          1036
## 34 2016-06-27 05:43:18 06:15:27 21:09:20 21:41:30          1269
## 35 2013-06-15 05:40:41 06:12:51 21:07:01 21:39:12          1267
## 36 2011-06-27 05:43:01 06:15:12 21:09:20 21:41:30          1269
## 37 2015-06-15 05:40:41 06:12:51 21:06:50 21:39:00          1266
## 38 2014-06-27 05:43:07 06:15:16 21:09:20 21:41:30          1269
## 39 2013-12-06 07:14:52 07:44:39 17:16:45 17:46:32          1036
##    dusk_minute dawn_minute sunrise_minute
## 1         1299         340            372
## 2         1299         340            372
## 3         1141         464            492
## 4         1299         340            372
## 5         1301         342            375
## 6         1298         340            372
## 7         1066         435            464
## 8         1299         340            372
## 9         1066         434            464
## 10        1299         340            372
## 11        1066         435            465
## 12        1066         435            465
## 13        1298         340            372
## 14        1066         435            465
## 15        1299         340            372
## 16        1298         340            372
## 17        1066         434            464
## 18        1301         343            375
## 19        1066         434            464
## 20        1066         435            464
## 21        1301         342            375
## 22        1301         343            375
## 23        1066         434            464
## 24        1301         342            375
## 25        1066         434            464
## 26        1301         343            375
## 27        1299         340            372
## 28        1298         340            372
## 29        1301         343            375
## 30        1298         340            372
## 31        1301         343            375
## 32        1301         343            375
## 33        1066         435            465
## 34        1301         343            375
## 35        1299         340            372
## 36        1301         343            375
## 37        1299         340            372
## 38        1301         343            375
## 39        1066         434            464

So we are looking at times between 5:16 and 9:41. We need to get those times and take out Twilight.

CIN_VODstops <- 
CINstops %>% 
left_join(
sunset_times,
by = "date"
) %>% 
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute | minute < dawn_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>% 
filter(
# Filter to get only the intertwilight period
  ###Skipping this part
#minute >= min_dusk_minute,
minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk
!(minute > sunset_minute & minute < dusk_minute)
)

Filter to just that time

CIN_VODstops <- CIN_VODstops %>% 
  filter(time > hm("17:16"), time < hm("21:41"))  #Sunset in  winter and summer in Cincy

If you want to ONLY look at highway stops

CIN_VODstops <- CIN_VODstops %>% filter(str_detect(location, 'I75|I71|I74|I 75|I 71|I 74'))
StopsByYearInDarkness <- CIN_VODstops %>%
filter(is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total) %>% 
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CIN_VODstops %>%
filter(!is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total)%>% 
mutate(sunshine = "Daylight")

This adds the two above together

TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>% 
  group_by(year)
  #View(TwilightTable)

With Hispanics, Asians, etc together

This is the raw numbers.

TwilightTable <- TwilightTable %>% 
  mutate(others = `unknown`+`asian/pacific islander`, total_stopped = `black` +`white`+`unknown`+`asian/pacific islander`) %>% 
  group_by(year) %>%
  mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=others/total_stopped)
  datatable(TwilightTable)

This looks at the percent black and percent white stopped in daylight, and during the same time period but because of the time change, in darkness.

TwilightTableTogether <- TwilightTable %>% 
  group_by(sunshine) %>% 
  summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
         Percent_of_Whites_Stopped=mean(white_percentage),
         other=mean(other_percentage))
  datatable(TwilightTableTogether)

Now, the same analysis, but with non-highway vehicular stops. (If you’re stopping someone on foot, you can see that peron in daylight or darkness.)

Bringing in Cincinnati stops again and getting the data to the format we want.

CINstops <- rio::import("https://stacks.stanford.edu/file/druid:hp256wp2687/hp256wp2687_oh_cincinnati_2019_08_13.csv.zip")

CINstops <- filter(CINstops, !is.na(date))
CINstops$date <- as.Date(CINstops$date, format = "%Y-%m-%d")

CINstops <- CINstops %>% 
  filter(year(date) < 2017, year(date) > 2008) %>% 
  filter(type == "vehicular")

CINstops$time<-parse_hms(CINstops$time)

CINstops <- CINstops %>% 
  filter(
    str_detect(reason_for_stop, 'PATROL|DISORD|HAZARD|INV|PARKING|QUERY|SUSPECT|SUSPICIOUS|TRAFFIC|UNKNOWN')| is.na(reason_for_stop))  
CIN_VODstops <- 
CINstops %>% 
left_join(
sunset_times,
by = "date"
) %>% 
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute | minute < dawn_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>% 
filter(
# Filter to get only the intertwilight period
  ###Skipping this part
#minute >= min_dusk_minute,
minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk
!(minute > sunset_minute & minute < dusk_minute)
)

Filter to just that time

CIN_VODstops <- CIN_VODstops %>% 
  filter(time > hm("17:16"), time < hm("21:41"))  #Sunset in  winter and summer in Cincy

If you want to ONLY look at street stops

CIN_VODstops <- CIN_VODstops %>% filter(!str_detect(location, 'I75|I71|I74|I 75|I 71|I 74'))
StopsByYearInDarkness <- CIN_VODstops %>%
filter(is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total) %>% 
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CIN_VODstops %>%
filter(!is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total)%>% 
mutate(sunshine = "Daylight")

This adds the two above together

TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>% 
  group_by(year)
  #View(TwilightTable)

With Hispanics, Asians, etc together

This is the raw numbers.

TwilightTable <- TwilightTable %>% 
  mutate(other = `unknown`+`asian/pacific islander`, total_stopped = `black` +`white`+`unknown`+`asian/pacific islander`) %>% 
  group_by(year) %>%
  mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
  datatable(TwilightTable)

This looks at the percent black and percent white stopped in daylight, and during the same time period but because of the time change, in darkness.

TwilightTableTogether <- TwilightTable %>% 
  group_by(sunshine) %>% 
  summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
         Percent_of_Whites_Stopped=mean(white_percentage),
         other=mean(other_percentage))
  datatable(TwilightTableTogether)

Points in a Polygon Analysis

Download Data from CensusReporter.org https://censusreporter.org/data/table/?table=B02001&geo_ids=16000US3915000,150|16000US3915000&primary_geo_id=16000US3915000

From this URL: https://censusreporter.org/data/table/?table=B02001&geo_ids=16000US3915000,150|16000US3915000&primary_geo_id=16000US3915000 download the shape file. (For the all block groups) I saved it in my working directory.

Getting race info for each neighborhood from the census

#You'll have to set your file pane location here
CincyRaceInfoByLocation2 <- sf::st_read("acs2017_5yr_B02001_15000US390610047011.shp") %>% 
mutate(
  Total=B02001001, 
  White_Alone=B02001002,
  Black_Alone=B02001003,
  PctWhite = round((White_Alone/Total)*100, 1),
  PctBlack = round((Black_Alone/Total)*100, 1)) %>% 
slice(-n())
## Reading layer `acs2017_5yr_B02001_15000US390610047011' from data source `/Users/Lucia/Code/SunlightStops/SunlightStopsStory/acs2017_5yr_B02001_15000US390610047011.shp' using driver `ESRI Shapefile'
## Simple feature collection with 298 features and 22 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -84.71216 ymin: 39.05188 xmax: -84.3454 ymax: 39.22502
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

Now we’re taking a look at our cincinnati neighborhood stops

# We only need the columns with the latitude and longitude

CINstops <- rio::import("https://stacks.stanford.edu/file/druid:hp256wp2687/hp256wp2687_oh_cincinnati_2019_08_13.csv.zip")

CINstops <- filter(CINstops, !is.na(date))
CINstops$date <- as.Date(CINstops$date, format = "%Y-%m-%d")

CINstops <- CINstops %>% 
  filter(year(date) < 2017, year(date) > 2008) 

CINstops$time<-parse_hms(CINstops$time)

# Check and eliminate the rows that don't have location information
CINstops <- CINstops %>% 
  filter(lat!="NA")

CINstops <- CINstops %>% 
  filter(
    str_detect(reason_for_stop, 'PATROL|DISORD|HAZARD|INV|PARKING|QUERY|SUSPECT|SUSPICIOUS|TRAFFIC|UNKNOWN')| is.na(reason_for_stop))  

And filter to only neighborhoods, take out Antartica (A small number may have gotten their lat and lon mixed up, which gave us points in Antartica.)

#We need the street stops only here. Stopping somone on the highway is not really "in" a neighborhood

CINstops <- CINstops %>% 
  filter(!lat<1,!lng>1) %>% 
  filter(!str_detect(location,'I75|I71|I74|I 75|I 71|I 74'))
  #filter(type=="vehicular", lat<46, lng>1)

Now, we can summarize the data by count and merge it back to the shape file and visualize it.

#Or, Getting Hamilton block groups (Cincinnati is entirely within Hamilton County)
library(tigris)
## To enable 
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
## 
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
## 
##     plot
options(tigris_class = "sf")
HamiltonBlockGroups <- block_groups("OH",county = "Hamilton")

CINstops_spatial <- CINstops %>% 
  st_as_sf(coords=c("lng", "lat"), crs="+proj=longlat") %>% 
  st_transform(crs=st_crs(HamiltonBlockGroups))



#Putting them altogether
CINStopsWithTracts <- st_join(HamiltonBlockGroups, CINstops_spatial)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
#Vlookup
CINStopsWithTracts$GEOID <- paste0("15000US",CINStopsWithTracts$GEOID)

This puts the stops and race percentage of each block together.

CINstops_With_Race_and_Tracts <-  CINStopsWithTracts

CINstops_With_Race_and_Tracts$PctWhite = CincyRaceInfoByLocation2$PctWhite[match(CINStopsWithTracts$GEOID, CincyRaceInfoByLocation2$geoid)]


CINstops_With_Race_and_Tracts$PctBlack = CincyRaceInfoByLocation2$PctBlack[match(CINStopsWithTracts$GEOID, CincyRaceInfoByLocation2$geoid)]

CINstops_With_Race_and_Tracts$Total_People_Per_Census_Block = CincyRaceInfoByLocation2$Total[match(CINStopsWithTracts$GEOID, CincyRaceInfoByLocation2$geoid)]

And finding the total number of stops per resident in neighbhorhoods with at least .75 percent white or black residents

CINstops_With_Race_and_Tracts <- CINstops_With_Race_and_Tracts %>% 
  filter(PctWhite!="NA", PctBlack!="NA")

Mostly_White_Cincy <- CINstops_With_Race_and_Tracts %>% 
filter(PctWhite>75) %>% 
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))



Mostly_Black_Cincy <- CINstops_With_Race_and_Tracts %>% 
filter(PctBlack>75) %>% 
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))


Mixed_Cincy <- CINstops_With_Race_and_Tracts %>% 
filter(!PctBlack>75|!PctWhite>75) %>% 
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))


mean(Mostly_White_Cincy$Per_Capita)
## [1] 0.5271393
mean(Mostly_Black_Cincy$Per_Capita)
## [1] 1.311483

This looks at the rate in ALL predominantely black areas and all predominantly white areas together.

And figures out the percentage more in black areas.

Total_People_In_Mostly_White_Cincy_Blocks <- sum(Mostly_White_Cincy$Total_In_Block)
Total_People_In_Mostly_Black_Cincy_Blocks <- sum(Mostly_Black_Cincy$Total_In_Block)
Total_People_In_Mixed_Blocks <-sum(Mixed_Cincy$Total_In_Block)

Total_Stops_In_Mostly_White_Cincy_Blocks <-sum(Mostly_White_Cincy$Total)
Total_Stops_In_Mostly_Black_Cincy_Blocks <-sum(Mostly_Black_Cincy$Total)
Total_Stops_In_Mixed_Blocks <-sum(Mixed_Cincy$Total)


Total_Stops_In_Mostly_White_Cincy_Blocks/Total_People_In_Mostly_White_Cincy_Blocks
## [1] 0.3992041
Total_Stops_In_Mostly_Black_Cincy_Blocks/Total_People_In_Mostly_Black_Cincy_Blocks
## [1] 0.8788373
CIN_white_rate <- Total_Stops_In_Mostly_White_Cincy_Blocks/Total_People_In_Mostly_White_Cincy_Blocks
CIN_black_rate <- Total_Stops_In_Mostly_Black_Cincy_Blocks/Total_People_In_Mostly_Black_Cincy_Blocks

(CIN_black_rate-CIN_white_rate)/CIN_white_rate
## [1] 1.201474

For the map, we want to look at each tract, and see the percentage stopped for black, white, and other. (And then Tableau has built-in race by block shading.)

CIN_Percent_Pulled_over_By_Block <- CINstops_With_Race_and_Tracts %>%
  group_by(GEOID) %>%
  summarize(
    Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block),Total_In_Block=mean(Total_People_Per_Census_Block))

CIN_Percent_Pulled_over_By_Block <- st_set_geometry(CIN_Percent_Pulled_over_By_Block, NULL)
#Uncomment this to export it
#rio::export(CIN_Percent_Pulled_over_By_Block, "Cincinnati number of stops per census block.csv")

Columbus

First, Loading the data and necessary libraries

## Load the data
# Replace the path below with the path to where your data lives
data_path <- "oh_columbus_without_duplicates.csv" #Note: the Stanford data had a lot of duplicates, so we had to work with this data. 
COLstops <- read_csv(data_path)
## Parsed with column specification:
## cols(
##   raw_row_number = col_character(),
##   date = col_date(format = ""),
##   time = col_time(format = ""),
##   location = col_character(),
##   lat = col_double(),
##   lng = col_double(),
##   precinct = col_double(),
##   zone = col_double(),
##   subject_race = col_character(),
##   subject_sex = col_character(),
##   type = col_character(),
##   arrest_made = col_logical(),
##   citation_issued = col_logical(),
##   warning_issued = col_logical(),
##   outcome = col_character(),
##   search_conducted = col_logical(),
##   reason_for_stop = col_character()
## )
table(COLstops$subject_race)
## 
## asian/pacific islander                  black               hispanic 
##                   2633                  56001                   3288 
##          other/unknown                  white 
##                   4481                  61754
# Additional data and fixed values we'll be using
population_2016 <- tibble(
subject_race = c(
"asian/pacific islander", "black", "hispanic", "other/unknown","white"
),
COLnum_people = c(42308, 230963, 48331, 33997, 481439) #https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=CF
) %>% 
mutate(subject_race = as.factor(subject_race))

COLcenter_lat <- 39.9612
COLcenter_lng <- -82.9988

Covering the basics

The Columns that we have

colnames(COLstops)
##  [1] "raw_row_number"   "date"             "time"            
##  [4] "location"         "lat"              "lng"             
##  [7] "precinct"         "zone"             "subject_race"    
## [10] "subject_sex"      "type"             "arrest_made"     
## [13] "citation_issued"  "warning_issued"   "outcome"         
## [16] "search_conducted" "reason_for_stop"

How many stops do we have in our dataset?

nrow(COLstops)
## [1] 128157

What date range does our data cover?

min(COLstops$date)
## [1] "2012-01-01"
max(COLstops$date)
## [1] "2016-12-30"

All dates will be Jan 1 if they don’t have the data

In Cincinnati we filter by date to get whole years. This is not necessary for Columbus Data

unique(COLstops$type)  # this allows us to look at just the first 10 rows
## [1] "vehicular"

NOTE: COLstops are ONLY vehicular, unlike Cincinnati.

This looks at the stop rate versus the population

StopRateDataTable <- COLstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>% 
left_join(
population_2016,
by = "subject_race"
) %>% 
mutate(stop_rate = n / sum(n), pop_rate = COLnum_people/sum(COLnum_people))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable)

So here, we see blacks are stopped at a rate much higher than the population.

How about counting how many stops by year and race?

COLstops %>% 
  group_by(year(date)) %>% 
# notice that you can also mutate `date` *within* the count function
count(year = year(date), subject_race)
## # A tibble: 25 x 4
## # Groups:   year(date) [5]
##    `year(date)`  year subject_race               n
##           <dbl> <dbl> <chr>                  <int>
##  1         2012  2012 asian/pacific islander   509
##  2         2012  2012 black                  11796
##  3         2012  2012 hispanic                 737
##  4         2012  2012 other/unknown            857
##  5         2012  2012 white                  14976
##  6         2013  2013 asian/pacific islander   528
##  7         2013  2013 black                   9822
##  8         2013  2013 hispanic                 602
##  9         2013  2013 other/unknown            756
## 10         2013  2013 white                  12160
## # … with 15 more rows

A simple visualization of that.

COLstops %>% 
count(year = year(date), subject_race) %>% 
ggplot(aes(x = year, y = n, color = subject_race)) +
geom_point() +
geom_line() 

Now looking at Arrests

This looks at the number arrested within each race group.

ArrestsByRaceCOL <- COLstops %>% 
  group_by(subject_race, arrest_made) %>% 
  summarize(Total =n()) %>% 
  group_by(subject_race) %>% 
  mutate(pct=Total/sum(Total))
  datatable(ArrestsByRaceCOL)
ArrestsRateByRaceCOL <- na.omit(ArrestsByRace) 
ggplot(ArrestsRateByRace, aes(x=subject_race, y=Total, fill=arrest_made))+
geom_col(position="dodge")

#ggsave("Cincinnati2012to2017ArrestRateByRace.jpg", plot = last_plot())

Arrests altogether

Arrest_Percentage_ByRace <- COLstops %>%
  filter(arrest_made=="TRUE") %>%
  group_by(subject_race) %>% 
  summarize(Total=n()) %>% 
  mutate(pct=Total/sum(Total))
  datatable(Arrest_Percentage_ByRace)

Now looking at searches. (Percesent searched by race)

SearchesByRaceCOL <- COLstops %>% 
  group_by(subject_race, search_conducted) %>%
  summarize(Total =n()) %>% 
  group_by(subject_race) %>% 
  mutate(pct=Total/sum(Total))
  datatable(SearchesByRaceCOL)
SearchesByRaceCOL <- na.omit(SearchesByRaceCOL)
ggplot(SearchesByRaceCOL, aes(x=subject_race, y=Total, fill=search_conducted))+
geom_col(position="dodge")

How many of those searched were arrested?

ArrestedAfterSearch <- COLstops %>% 
  filter(search_conducted=="TRUE") %>% 
  group_by(subject_race, arrest_made) %>%
  summarize(Total =n()) %>% 
  group_by(subject_race) %>% 
  mutate(pct=Total/sum(Total))
  datatable(ArrestedAfterSearch)
Search_Percentage_ByRace <- COLstops %>%
  filter(search_conducted=="TRUE") %>%
  group_by(subject_race) %>% 
  summarize(Total=n()) %>% 
  mutate(pct=Total/sum(Total))
  datatable(Search_Percentage_ByRace)

###Columbus Disparity Index

white <- .48/.58
black <- .44/.28

black/white
## [1] 1.89881

Veil of Darkness Columbus, Looking JUST at street stops

library(dplyr)
COLstops <- COLstops %>% filter(!str_detect(location,'70|71'))

Note: Columbus had 10,495 stops that could not be geocoded. So these are removed because the rest of this analysis reqires latitude and longitude.This could affect the outcome. (Though the number is comparatively small.)

COLstops <- COLstops %>% 
  filter(lat!="NA")

Let’s calculate the sunset and dusk times for Columbus in 2016.

(Note that we distinguish here between sunset and dusk. Sunset is the point in time where the sun dips below the horizon. Dusk, also known as the end of civil twilight, is the time when the sun is six degrees below the horizon, and it is widely considered to be dark. We’ll explain why this is relevant in a moment.)

# Get timezone for Columbus
COLtz <- lutz::tz_lookup_coords(COLcenter_lat, COLcenter_lng, warn = F)

# Helper function
time_to_minute <- function(time) {
hour(lubridate::hms(time)) * 60 + minute(lubridate::hms(time))
}

# Compute sunset time for each date in our dataset
sunset_times <- 
COLstops %>%
mutate(
lat = COLcenter_lat,
lon = COLcenter_lng
) %>% 
select(date, lat, lon) %>%
distinct() %>%
getSunlightTimes(
data = ., 
keep = c("sunset", "dusk"), 
tz = COLtz
) %>% 
mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>% 
mutate(
  sunset_minute = time_to_minute(sunset),
  dusk_minute = time_to_minute(dusk),
  date = ymd(str_sub(date, 1, 10))
) %>% 
  select(date, sunset, dusk, ends_with("minute"))

Great. Now, using sunset_times, what is Columbus’ inter-twilight period?

sunset_times %>% 
filter(dusk == min(dusk) | dusk == max(dusk))
##         date   sunset     dusk sunset_minute dusk_minute
## 1 2012-06-26 21:06:01 21:38:49          1266        1298
## 2 2014-06-27 21:06:02 21:38:49          1266        1298
## 3 2014-12-07 17:08:12 17:38:27          1028        1058
## 4 2015-06-27 21:06:01 21:38:49          1266        1298
## 5 2015-12-07 17:08:12 17:38:27          1028        1058
## 6 2016-06-26 21:06:01 21:38:49          1266        1298

For a time in this range, like 6:30pm, part of the year it’s still light, and part of the year, it’s still dark. So we can compare the proportion of black drivers stopped at 6:30pm when it’s light (usually in the summer) to the proportion of black drivers stopped at 6:30pm when it’s dark (usually in the winter). This way, we’re really seeing the effect of darkness, since driving and deployment patterns tend to change only with clock time. For example, the demographics of commuters who are on the road at 6:30pm likely stays pretty steady throughout the year, compared to demographics of commuters at 5pm versus at 9pm.

One quick note: We want to make sure to filter out stops that occurred in the approximately 30-minute period between sunset and dusk (end of civil twilight), since it is ambiguous as to whether that period is light or dark, which would muddle up our analysis.

Let’s join the sunset times to our stops data and

CBUS_VOD_stops <- 
COLstops %>% 
left_join(
sunset_times,
by = "date"
) %>% 
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>% 
filter(
# Filter to get only the intertwilight period
#minute >= min_dusk_minute,
#minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk
!(minute > sunset_minute & minute < dusk_minute))
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs

## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs

How many stops are in our this new, filtered data frame?

CBUS_VOD_stops %>% nrow()
## [1] 92260

Filtering for just the timeframe needed.

CBUS_VOD_stops <- CBUS_VOD_stops %>% 
  filter(time > hm("17:08"), time < hm("21:06"))
StopsByYearInDarkness <- CBUS_VOD_stops %>%
filter(is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total) %>% 
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CBUS_VOD_stops %>%
filter(!is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total)%>% 
mutate(sunshine = "Daylight")

This adds the two above together

TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>% 
  group_by(year)
  #View(TwilightTable)

With Hispanics, Asians, etc together

TwilightTable <- TwilightTable %>% 
  mutate(other = `other/unknown`+`asian/pacific islander`, total_stopped = `black` +`white`+`other/unknown`+`asian/pacific islander`) %>% 
  group_by(year) %>%
  mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
  datatable(TwilightTable)

Now just daylight and darkness

TwilightTableTogether <- TwilightTable %>% 
  group_by(sunshine) %>% 
  summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
         Percent_of_Whites_Stopped=mean(white_percentage),
         other=mean(other_percentage))
  datatable(TwilightTableTogether)

Now we load COLstops again, and do the same analysis, but for highway stops.

## Load the data
# Replace the path below with the path to where your data lives
data_path <- "oh_columbus_without_duplicates.csv"
COLstops <- read_csv(data_path)
## Parsed with column specification:
## cols(
##   raw_row_number = col_character(),
##   date = col_date(format = ""),
##   time = col_time(format = ""),
##   location = col_character(),
##   lat = col_double(),
##   lng = col_double(),
##   precinct = col_double(),
##   zone = col_double(),
##   subject_race = col_character(),
##   subject_sex = col_character(),
##   type = col_character(),
##   arrest_made = col_logical(),
##   citation_issued = col_logical(),
##   warning_issued = col_logical(),
##   outcome = col_character(),
##   search_conducted = col_logical(),
##   reason_for_stop = col_character()
## )

Looking JUST at highway stops

library(dplyr)
COLstops <- COLstops %>% filter(str_detect(location,'70|71'))

Note: Columbus had 10,495 stops that could not be geocoded. So these are removed because the rest of this analysis reqires latitude and longitude.This could affect the outcome. (Though the number is comparatively small.)

COLstops <- COLstops %>% 
  filter(lat!="NA")

(Note that we distinguish here between sunset and dusk.)

# Get timezone for Columbus
COLtz <- lutz::tz_lookup_coords(COLcenter_lat, COLcenter_lng, warn = F)

# Helper function
time_to_minute <- function(time) {
hour(lubridate::hms(time)) * 60 + minute(lubridate::hms(time))
}

# Compute sunset time for each date in our dataset
sunset_times <- COLstops %>%
  mutate(lat = COLcenter_lat,
         lon = COLcenter_lng) %>% 
  select(date, lat, lon) %>%
  distinct() %>%
  getSunlightTimes(data = ., 
                   keep = c("sunset", "dusk"), tz = COLtz) %>% 
  mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>% 
  mutate(
    sunset_minute = time_to_minute(sunset),
    dusk_minute = time_to_minute(dusk),
    date = ymd(str_sub(date, 1, 10))) %>% 
  select(date, sunset, dusk, ends_with("minute"))

Great. Now, using sunset_times, what is Columbus’ inter-twilight period?

sunset_times %>% 
filter(dusk == min(dusk) | dusk == max(dusk))
##         date   sunset     dusk sunset_minute dusk_minute
## 1 2012-06-26 21:06:01 21:38:49          1266        1298
## 2 2014-06-27 21:06:02 21:38:49          1266        1298
## 3 2014-12-07 17:08:12 17:38:27          1028        1058
## 4 2015-06-27 21:06:01 21:38:49          1266        1298
## 5 2015-12-07 17:08:12 17:38:27          1028        1058
## 6 2016-06-26 21:06:01 21:38:49          1266        1298

Merging veil of darkness stop times

CBUS_VOD_stops <- 
COLstops %>% 
left_join(
sunset_times,
by = "date"
) %>% 
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>% 
filter(
# Filter to get only the intertwilight period
#minute >= min_dusk_minute,
#minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk
!(minute > sunset_minute & minute < dusk_minute))
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs

## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings
## failed to parse, or all strings are NAs

How many stops are in our this new, filtered data frame?

CBUS_VOD_stops %>% nrow()
## [1] 21869

Filtering for just the timeframe needed.

CBUS_VOD_stops <- CBUS_VOD_stops %>% 
  filter(time > hm("17:08"), time < hm("21:06"))

Comparing stops during daylight savings time

StopsByYearInDarkness <- CBUS_VOD_stops %>%
filter(is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total) %>% 
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CBUS_VOD_stops %>%
filter(!is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total)%>% 
mutate(sunshine = "Daylight")

This adds the two above together

TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>% 
  group_by(year)
  #View(TwilightTable)

With Hispanics, Asians, etc together

TwilightTable <- TwilightTable %>% 
  mutate(other = `other/unknown`+`asian/pacific islander`, total_stopped = `black` +`white`+`other/unknown`+`asian/pacific islander`) %>% 
  group_by(year) %>%
  mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
  datatable(TwilightTable)

Now just daylight and darkness

TwilightTableTogether <- TwilightTable %>% 
  group_by(sunshine) %>% 
  summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
         Percent_of_Whites_Stopped=mean(white_percentage),
         other=mean(other_percentage))
  datatable(TwilightTableTogether)

Point in Polygon analysis

Note these codes are from the data dictionary.

#You'll have to set your file pane location here

CBUS_RaceInfoByLocation <- sf::st_read("Columbus Block Groups by Race- acs2017_5yr_B02001_15000US390490046104/acs2017_5yr_B02001_15000US390490046104.shp") %>% 
mutate(
  Total=B02001001, 
  White_Alone=B02001002,
  Black_Alone=B02001003,
  PctWhite = round((White_Alone/Total)*100, 1),
  PctBlack = round((Black_Alone/Total)*100, 1)) %>% 
slice(-n())
## Reading layer `acs2017_5yr_B02001_15000US390490046104' from data source `/Users/Lucia/Code/SunlightStops/SunlightStopsStory/Columbus Block Groups by Race- acs2017_5yr_B02001_15000US390490046104/acs2017_5yr_B02001_15000US390490046104.shp' using driver `ESRI Shapefile'
## Simple feature collection with 700 features and 22 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -83.23604 ymin: 39.79894 xmax: -82.76946 ymax: 40.15735
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
#We need the street stops only here. Stopping somone on the highway is not really in a neighborhood
data_path <- "oh_columbus_without_duplicates.csv"
COLstops <- read_csv(data_path)
## Parsed with column specification:
## cols(
##   raw_row_number = col_character(),
##   date = col_date(format = ""),
##   time = col_time(format = ""),
##   location = col_character(),
##   lat = col_double(),
##   lng = col_double(),
##   precinct = col_double(),
##   zone = col_double(),
##   subject_race = col_character(),
##   subject_sex = col_character(),
##   type = col_character(),
##   arrest_made = col_logical(),
##   citation_issued = col_logical(),
##   warning_issued = col_logical(),
##   outcome = col_character(),
##   search_conducted = col_logical(),
##   reason_for_stop = col_character()
## )
COLstops <- COLstops %>% filter(!str_detect(location,'70|71'))



# Check and eliminate the rows that don't have location information
COLstops <- COLstops %>% 
  filter(lat!="NA")

What just happened: Every point in the original stops list now has a corresponding census tract.

Now, we can summarize the data by count and merge it back to the shape file and visualize it.

#Or, Getting Hamilton block groups (Cincinnati is entirely within Hamilton County)
library(tigris)
options(tigris_class = "sf")
FranklinBlockGroups <- block_groups("OH",county = "Franklin")

COLstops_spatial <- COLstops %>% 
  st_as_sf(coords=c("lng", "lat"), crs="+proj=longlat") %>% 
  st_transform(crs=st_crs(FranklinBlockGroups))



#Putting them altogether
COLStopsWithTracts <- st_join(FranklinBlockGroups, COLstops_spatial)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
#plotting them
ggplot(FranklinBlockGroups) + 
  geom_sf(color="black") +
  geom_point(data=COLstops, aes(x=lng, y=lat), color="red", alpha=0.05) +
  theme_void() +
  theme(panel.grid.major = element_line(colour = 'transparent')) +
  labs(title="Traffic stops in Franklin")

#Vlookup
COLStopsWithTracts$GEOID <- paste0("15000US",COLStopsWithTracts$GEOID)

This puts the stops and race percentage of each block together.

COLstops_With_Race_and_Tracts <-  COLStopsWithTracts

COLstops_With_Race_and_Tracts$PctWhite = CBUS_RaceInfoByLocation$PctWhite[match(COLStopsWithTracts$GEOID, CBUS_RaceInfoByLocation$geoid)]

COLstops_With_Race_and_Tracts$PctBlack = CBUS_RaceInfoByLocation$PctBlack[match(COLStopsWithTracts$GEOID, CBUS_RaceInfoByLocation$geoid)]

COLstops_With_Race_and_Tracts$Total_People_Per_Census_Block = CBUS_RaceInfoByLocation$Total[match(COLStopsWithTracts$GEOID, CBUS_RaceInfoByLocation$geoid)]
COLstops_With_Race_and_Tracts <- COLstops_With_Race_and_Tracts %>% 
  filter(PctWhite!="NA", PctBlack!="NA")

Mostly_White_CBUS <- COLstops_With_Race_and_Tracts %>% 
filter(PctWhite>75) %>% 
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))



Mostly_Black_CBUS <- COLstops_With_Race_and_Tracts %>% 
filter(PctBlack>75) %>% 
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))


mean(Mostly_White_CBUS$Per_Capita)
## [1] 0.1105331
mean(Mostly_Black_CBUS$Per_Capita)
## [1] 0.1733853

This looks at the rate in ALL predominantely black areas and all predominantely white areas together.

And figures out the percentage more in black areas.

Total_People_In_Mostly_White_CBUS_Blocks <- sum(Mostly_White_CBUS$Total_In_Block)
Total_People_In_Mostly_Black_CBUS_Blocks <- sum(Mostly_Black_CBUS$Total_In_Block)

Total_Stops_In_Mostly_White_CBUS_Blocks <-sum(Mostly_White_CBUS$Total)
Total_Stops_In_Mostly_Black_CBUS_Blocks <-sum(Mostly_Black_CBUS$Total)


Total_Stops_In_Mostly_White_CBUS_Blocks/Total_People_In_Mostly_White_CBUS_Blocks
## [1] 0.07372752
Total_Stops_In_Mostly_Black_CBUS_Blocks/Total_People_In_Mostly_Black_CBUS_Blocks
## [1] 0.1356254
CBUS_white_rate <- Total_Stops_In_Mostly_White_CBUS_Blocks/Total_People_In_Mostly_White_CBUS_Blocks
CBUS_black_rate <- Total_Stops_In_Mostly_Black_CBUS_Blocks/Total_People_In_Mostly_Black_CBUS_Blocks

(CBUS_black_rate-CBUS_white_rate)/CBUS_white_rate
## [1] 0.8395498

For the map, we want to look at each tract, and see the percentage stopped for black, white, and other. (And then Tableau has built-in race by block shading.)

COL_Percent_Pulled_over_By_Block <- COLstops_With_Race_and_Tracts %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))

COL_Percent_Pulled_over_By_Block <- st_set_geometry(COL_Percent_Pulled_over_By_Block, NULL)
#Uncomment this to export it
#rio::export(COL_Percent_Pulled_over_By_Block, "Columbus number of stops per census block.csv")

Cleveland

Note: We filed four separate open records requests from the Cleveland Police from March through September 2019, for 2015, 2016, 2017, and 2018 data. They kept giving us part of 2015 (mostly january), part of 2018 (mostly December,) and two versions of 2016 (though one was labeled 2017.) After driving from Columbus to Cleveland and asking again, they finally gave us 2017 as well. In PDF form. So this is not part of the Stanford data, this is data that was cleaned and geocoded by Eye on Ohio thanks to Google’s Maps API credits. You can check out the original PDFs here. [2017 stops ] (https://eyeonohio.com/wp-content/uploads/2019/11/Cleveland-Police-Stops.pdf) and [2016 stops.] (https://eyeonohio.com/wp-content/uploads/2019/11/2016_UTT.pdf)

## Load the data

CLEstops <- rio::import("CLE_stops_cleaned_and_geocoded.csv")

# Additional data and fixed values we'll be using
population_2017 <- tibble(
subject_race = c(
"asian/pacific islander", "black", "hispanic", "other/unknown","white"
),
num_people = c(7930, 193817, 43466, 11508, 131338) #This is where I got this data:https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=CF 
#note "white" means 'non-hispanic white'
) %>% 
mutate(subject_race = as.factor(subject_race))

#Have to add in the latitude and longitude of wherever you are (this is slightly different for CLEumbus)
center_lat <- 41.4993
center_lng <- -81.6944

Covering the basics

Note the original data set has more rows, because each offense is a separate row. So someone getting two tickets would show up twice. And we didn’t want that.

colnames(CLEstops)
## [1] "Ticket_Date"   "Ticket_Time"   "TicketAddress" "DOBTicketed"  
## [5] "Arrest"        "Race"          "Sex"           "lat"          
## [9] "lon"

Looking for duplicates

CLE_duplicates <- CLEstops %>% 
  group_by(Ticket_Date, Ticket_Time, DOBTicketed) %>% 
  summarize (Total=n())
  head(CLE_duplicates)
## # A tibble: 6 x 4
## # Groups:   Ticket_Date, Ticket_Time [5]
##   Ticket_Date Ticket_Time DOBTicketed Total
##   <chr>       <chr>       <chr>       <int>
## 1 1/1/16      1:20        Aug 30 1978     2
## 2 1/1/16      7:46        Jan 4 1994      2
## 3 1/1/16      7:46        Jul 13 1989     2
## 4 1/1/16      9:32        Jan 1 1997      2
## 5 1/1/17      0:10        Apr 9 1993      2
## 6 1/1/17      0:34        Jun 20 1990     2

Chances are, two people with the same birth date probably did not get the a ticket at the same time on the same date. Looks like there are a lot of duplicates here that we will just get the unique stops from.

CLEstops <- dplyr::distinct(CLEstops)

Changing data and time to date and time formats

#CLEstops <- filter(CLEstops, !is.na(Ticket_Date))
CLEstops$Ticket_Date <- mdy(CLEstops$Ticket_Date)
CLEstops$Ticket_Time <-parse_hm(CLEstops$Ticket_Time)

CLEstops <- CLEstops %>% 
  mutate(date=Ticket_Date, time=Ticket_Time, subject_race=Race)

max(CLEstops$date)
## [1] "2017-12-30"
min(CLEstops$date)
## [1] "2016-01-01"

How does Cleveland count race?

CLEstops %>% 
count(subject_race)
## # A tibble: 9 x 2
##   subject_race     n
##   <chr>        <int>
## 1 America!         3
## 2 Americar        66
## 3 Arab           200
## 4 Asian          171
## 5 Black        27860
## 6 Clevelam         1
## 7 Native H        26
## 8 Unknown       1809
## 9 White        16943

There’s some ambiguity in this data. Like, are Hispanics tagged as white? Does “Native H” mean Native American? “America!”… what does that mean? We have sent a request to the police about this.

You might want to do options(scipen=999) so that you don’t get scientific notation here.

# This method builds off of using `count` as above
CLEstops %>% 
count(subject_race) %>% 
mutate(prop = n / sum(n))
## # A tibble: 9 x 3
##   subject_race     n      prop
##   <chr>        <int>     <dbl>
## 1 America!         3 0.0000637
## 2 Americar        66 0.00140  
## 3 Arab           200 0.00425  
## 4 Asian          171 0.00363  
## 5 Black        27860 0.592    
## 6 Clevelam         1 0.0000212
## 7 Native H        26 0.000552 
## 8 Unknown       1809 0.0384   
## 9 White        16943 0.360
# This method uses the group_by/summarize paradigm
CLEstops %>% 
group_by(subject_race) %>% 
summarize(
n = n(),
prop = n / nrow(.)
)
## # A tibble: 9 x 3
##   subject_race     n      prop
##   <chr>        <int>     <dbl>
## 1 America!         3 0.0000637
## 2 Americar        66 0.00140  
## 3 Arab           200 0.00425  
## 4 Asian          171 0.00363  
## 5 Black        27860 0.592    
## 6 Clevelam         1 0.0000212
## 7 Native H        26 0.000552 
## 8 Unknown       1809 0.0384   
## 9 White        16943 0.360

At first glance, we see there are more stops of black drivers than any other race.

How about counting how many stops by year and race?

CLEstops %>% 
  group_by(year(date)) %>% 
# notice that you can also mutate `date` *within* the count funciton
count(year = year(date), subject_race)
## # A tibble: 16 x 4
## # Groups:   year(date) [2]
##    `year(date)`  year subject_race     n
##           <dbl> <dbl> <chr>        <int>
##  1         2016  2016 America!         3
##  2         2016  2016 Americar        43
##  3         2016  2016 Arab           105
##  4         2016  2016 Asian           97
##  5         2016  2016 Black        14066
##  6         2016  2016 Native H         9
##  7         2016  2016 Unknown        903
##  8         2016  2016 White         8822
##  9         2017  2017 Americar        23
## 10         2017  2017 Arab            95
## 11         2017  2017 Asian           74
## 12         2017  2017 Black        13794
## 13         2017  2017 Clevelam         1
## 14         2017  2017 Native H        17
## 15         2017  2017 Unknown        906
## 16         2017  2017 White         8121

A simple visualization of this.

CLEstops %>% 
count(month = month(date), subject_race) %>% 
ggplot(aes(x = month, y = n, color = subject_race)) +
geom_point() +
geom_line() 

Looks pretty consistent month to month.

Note: For the purposes of our analysis, we are going to combine “Native H”, Clevelandm, America, Americar, Arab, and Unknown together.

  CLEstops$subject_race <-recode(CLEstops$subject_race, "Native H"="other/unknown", 
                                "Clevelam"="hispanic", 
                                "America!"="other/unknown",
                                "Americar"="other/unknown", 
                                "Arab"="other/unknown", 
                                "Unknown"="other/unknown",
                               "Asian"="asian/pacific islander",
                                "Black"="black",
                                "White"="white")

This preps the proportion.

population_2017 <- population_2017 %>% 
mutate(Proportion_Of_Population = num_people / sum(num_people)) #I added this to the populatin data frame

This creates our stop rate data table.

StopRateDataTable <- CLEstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>% 
left_join(
population_2017,
by = "subject_race"
) %>% 
mutate(stop_rate = n / sum(n))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable)

So it looks like blacks are about 50 percent of the population though they are 60 percent of the stops. There’s a big discrepancy here though: Hispanics are 11 percent of the population, but there is no Cleveland “race” stop labeled “Hispanic.” (We put in one just so that it actually showed up in this table.) This isn’t inaccurate– legally, Hispanic is an “ethnicity” and not a “race.” Though 200 are listed as “Arab” which isn’t a race either.

population_2017 <- tibble(
subject_race = c(
"asian/pacific islander", "black","other/unknown","white"
),
num_people = c(8026, 196006, 30062, 154718) #This is from the 2017 American Community Survey of Cleveland. "Other/unknown" is: 12,479 for 'Some other Race', 60 for native Hawaiian, 1931 for American Indian/Alaskan Native, 15,592 for 'two or more races.' Food for thought: should we include the black and white as (6919) as part of African American?

) %>% 
mutate(subject_race = as.factor(subject_race))


population_2017 <- population_2017 %>% 
mutate(Proportion_Of_Population = num_people / sum(num_people))


CLEstops$subject_race <-recode(CLEstops$subject_race, "hispanic"="other/unknown")


StopRateDataTable_No_Hispanic <- CLEstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>% 
left_join(
population_2017,
by = "subject_race"
) %>% 
mutate(stop_rate = n / sum(n))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable_No_Hispanic)

Cleveland Disparity Index

white <- .36/.4
black <- .59/.5

black/white
## [1] 1.311111

Veil of Darkness Cleveland

sunset and dusk times for Cleveland.

(Note that we distinguish here between sunset and dusk. Sunset is the point in time where the sun dips below the horizon. Dusk, also known as the end of civil twilight, is the time when the sun is six degrees below the horizon, and it is widely considered to be dark.)

# Get timezone for Cleveland
tz <- lutz::tz_lookup_coords(center_lat, center_lng, warn = F)

# Helper function
time_to_minute <- function(time) {
  hour(lubridate::hms(time)) * 60 + minute(lubridate::hms(time))
}

# Compute sunset time for each date in our dataset
sunset_times <- CLEstops %>%
  mutate(
    lat = center_lat,
    lon = center_lng
    ) %>% 
  select(date, lat, lon) %>%
  distinct() %>%
  getSunlightTimes(
    data = ., 
    keep = c("sunset", "dusk"), 
    tz = tz
  ) %>% 
  mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>% 
  mutate(
    sunset_minute = time_to_minute(sunset),
    dusk_minute = time_to_minute(dusk),
    date = ymd(str_sub(date, 1, 10))
  ) %>% 
  select(date, sunset, dusk, ends_with("minute"))

Great. Now, using sunset_times, what is Cleveland’s inter-twilight period?

sunset_times %>% 
filter(dusk == min(dusk) | dusk == max(dusk))
##         date   sunset     dusk sunset_minute dusk_minute
## 1 2016-12-08 16:58:18 17:29:29          1018        1049
## 2 2017-12-08 16:58:18 17:29:29          1018        1049
## 3 2017-12-07 16:58:20 17:29:29          1018        1049
## 4 2016-06-26 21:05:57 21:39:57          1265        1299
## 5 2016-12-07 16:58:19 17:29:29          1018        1049
## 6 2017-06-26 21:05:56 21:39:57          1265        1299
datatable(.Last.value)

Joining sunset times to our data.

CLE_VODstops <- CLEstops %>% 
  left_join(
    sunset_times,
    by = "date"
    ) %>% 
  mutate(
    minute = time_to_minute(time),
    minutes_after_dark = minute - dusk_minute,
    is_dark = minute > dusk_minute,
    min_dusk_minute = min(dusk_minute),
    max_dusk_minute = max(dusk_minute),
    is_black = subject_race == "black"
    ) %>% 
  filter(
# Filter to get only the intertwilight period
  ###Skipping this part
    minute >= min_dusk_minute,
    minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk (twilight)
!(minute > sunset_minute & minute < dusk_minute)
)

Filtering for just the timeframe needed.

CLE_VODstops <- CLE_VODstops %>% 
  filter(time > hm("16:58"), time < hm("21:39"))

Looking JUST at highways

library(dplyr)
CLE_VODstops <- CLE_VODstops %>% 
  filter(str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))
StopsByYearInDarkness <- CLE_VODstops %>%
filter(is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total) %>% 
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CLE_VODstops %>%
filter(!is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total)%>% 
mutate(sunshine = "Daylight")

This adds the two above together

TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>% 
  group_by(year)
  datatable(TwilightTable)

Cleveland data has no Hispanics

TwilightTable <- TwilightTable %>% 
  mutate(other = `other/unknown`, total_stopped = `black` +`white`+`other/unknown`) %>% 
  group_by(sunshine) %>%
  mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
  #select(black, white, other, year) %>% 
  datatable(TwilightTable)
TwilightTableTogether <- TwilightTable %>% 
  group_by(sunshine) %>% 
  summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
         Percent_of_Whites_Stopped=mean(white_percentage),
         other=mean(other_percentage))
  datatable(TwilightTableTogether)

Looking JUST at street stops. A stop on a highway really isn’t “in” a neighborhood.

CLE_VODstops <- CLEstops %>% 
  left_join(
    sunset_times,
    by = "date"
    ) %>% 
  mutate(
    minute = time_to_minute(time),
    minutes_after_dark = minute - dusk_minute,
    is_dark = minute > dusk_minute,
    min_dusk_minute = min(dusk_minute),
    max_dusk_minute = max(dusk_minute),
    is_black = subject_race == "black"
    ) %>% 
  filter(
# Filter to get only the intertwilight period
  ###Skipping this part
    minute >= min_dusk_minute,
    minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk (twilight)
!(minute > sunset_minute & minute < dusk_minute)
)

taking out the highways, looking for just street stops.

CLE_VODstops <- CLE_VODstops %>% 
  filter(!str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))

Filtering for just the timeframe needed.

CLE_VODstops <- CLE_VODstops %>% 
  filter(time > hm("16:58"), time < hm("21:39"))
StopsByYearInDarkness <- CLE_VODstops %>%
filter(is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total) %>% 
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CLE_VODstops %>%
filter(!is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total)%>% 
mutate(sunshine = "Daylight")

This adds the two above together

TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>% 
  group_by(year)
  datatable(TwilightTable)

Cleveland data has no Hispanics

TwilightTable <- TwilightTable %>% 
  mutate(other = `other/unknown`, total_stopped = `black` +`white`+`other/unknown`) %>% 
  group_by(sunshine) %>%
  mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
  #select(black, white, other, year) %>% 
  datatable(TwilightTable)
TwilightTableTogether <- TwilightTable %>% 
  group_by(sunshine) %>% 
  summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
         Percent_of_Whites_Stopped=mean(white_percentage),
         other=mean(other_percentage))
  datatable(TwilightTableTogether)

Looking at Arrests

Looking at Arrest Rates

ArrestsByRace <- CLEstops %>% 
  group_by(subject_race, Arrest) %>% 
  summarize(Total =n()) %>% 
  group_by(subject_race) %>% 
  mutate(pct=Total/sum(Total))
  datatable(ArrestsByRace)
ArrestsRateByRace <- na.omit(ArrestsByRace) 
ggplot(ArrestsRateByRace, aes(x=subject_race, y=Total, fill=Arrest))+
geom_col(position="dodge")

#ggsave("Cincinnati2012to2017ArrestRateByRace.jpg", plot = last_plot())
Arrest_Percentage_ByRace <- CLEstops %>%
  filter(Arrest=="Yes") %>%
  group_by(subject_race) %>% 
  summarize(Total=n()) %>% 
  mutate(pct=Total/sum(Total))
  datatable(Arrest_Percentage_ByRace)

Note: Cleveland doesn’t have search data.

Point in Polygon Analysis

Shapefiles from https://censusreporter.org/data/table/?table=B02001&geo_ids=16000US3916000,05000US39035,31000US17460,04000US39,01000US,150|16000US3916000,140|16000US3916000,150|16000US3916000&primary_geo_id=16000US3916000

Note these codes are from the data dictionary.

#You'll have to set your file pane location here
setwd("~/Code/SunlightStops/SunlightStopsStory/acs2017_5yr_B02001_15000US390351093011")
CLE_RaceInfoByLocation <- sf::st_read("acs2017_5yr_B02001_15000US390351093011.shp")
## Reading layer `acs2017_5yr_B02001_15000US390351093011' from data source `/Users/Lucia/Code/SunlightStops/SunlightStopsStory/acs2017_5yr_B02001_15000US390351093011/acs2017_5yr_B02001_15000US390351093011.shp' using driver `ESRI Shapefile'
## Simple feature collection with 467 features and 22 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.2302 ymin: -14.60134 xmax: 179.8597 ymax: 71.43979
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
CLE_RaceInfoByLocation <- CLE_RaceInfoByLocation %>% 
  mutate(
  Total=B02001001, 
  White_Alone=B02001002,
  Black_Alone=B02001003,
  PctWhite = round((White_Alone/Total)*100, 1),
  PctBlack = round((Black_Alone/Total)*100, 1)) %>% 
slice(-n())
# We only need the columns with the latitude and longitude
#We need the street stops only here. Stopping somone on the highway is not really in a neighborhood
CLEstops <- rio::import("CLE_stops_cleaned_and_geocoded.csv")
CLEstops <- dplyr::distinct(CLEstops)
CLEstops$Ticket_Date <- mdy(CLEstops$Ticket_Date)
CLEstops$Ticket_Time <-parse_hm(CLEstops$Ticket_Time)
CLEstops <- CLEstops %>% 
  mutate(date=Ticket_Date, time=Ticket_Time, subject_race=Race)

# Check and eliminate the rows that are on a highway, which really isn't "in" a neighborhood
CLEstops <- CLEstops %>% 
  filter(!str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))
#Getting Cleveland block groups 
library(tigris)
options(tigris_class = "sf")
CLEBlockGroups <- block_groups("OH",county = c("Cuyahoga","Lorain","Lake","Medina"))

CLEstops_spatial <- CLEstops %>% 
  st_as_sf(coords=c("lon", "lat"), crs="+proj=longlat") %>% 
  st_transform(crs=st_crs(CLEBlockGroups))



#Putting them altogether
CLEStopsWithTracts <- st_join(CLEBlockGroups, CLEstops_spatial)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
#plotting them
ggplot(CLEBlockGroups) + 
  geom_sf(color="black") +
  geom_point(data=CLEstops, aes(x=lon, y=lat), color="red", alpha=0.05) +
  theme_void() +
  theme(panel.grid.major = element_line(colour = 'transparent')) +
  labs(title="Traffic stops in Cleveland")

#Vlookup
CLEStopsWithTracts$GEOID <- paste0("15000US",CLEStopsWithTracts$GEOID)

This puts the stops and race percentage of each block together.

CLEstops_With_Race_and_Tracts <-  CLEStopsWithTracts

CLEstops_With_Race_and_Tracts$PctWhite = CLE_RaceInfoByLocation$PctWhite[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]

CLEstops_With_Race_and_Tracts$PctBlack = CLE_RaceInfoByLocation$PctBlack[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]

CLEstops_With_Race_and_Tracts$Total_People_Per_Census_Block = CLE_RaceInfoByLocation$Total[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]

This looks at the average for each group

CLEstops_With_Race_and_Tracts <- CLEstops_With_Race_and_Tracts %>% 
  filter(PctWhite!="NA", PctBlack!="NA")

Mostly_White_CLE <- CLEstops_With_Race_and_Tracts %>% 
filter(PctWhite>75) %>% 
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))



Mostly_Black_CLE <- CLEstops_With_Race_and_Tracts %>% 
filter(PctBlack>75) %>% 
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))


mean(Mostly_White_CLE$Per_Capita)
## [1] 0.08560138
mean(Mostly_Black_CLE$Per_Capita)
## [1] 0.120398

This looks at the rate in ALL predominantely black areas and all predominantely white areas together.

And figures out the percentage more in black areas.

Total_People_In_Mostly_White_CLE_Blocks <- sum(Mostly_White_CLE$Total_In_Block)
Total_People_In_Mostly_Black_CLE_Blocks <- sum(Mostly_Black_CLE$Total_In_Block)

Total_Stops_In_Mostly_White_CLE_Blocks <-sum(Mostly_White_CLE$Total)
Total_Stops_In_Mostly_Black_CLE_Blocks <-sum(Mostly_Black_CLE$Total)


Total_Stops_In_Mostly_White_CLE_Blocks/Total_People_In_Mostly_White_CLE_Blocks
## [1] 0.07747141
Total_Stops_In_Mostly_Black_CLE_Blocks/Total_People_In_Mostly_Black_CLE_Blocks
## [1] 0.09729283
CLE_white_rate <- Total_Stops_In_Mostly_White_CLE_Blocks/Total_People_In_Mostly_White_CLE_Blocks
CLE_black_rate <- Total_Stops_In_Mostly_Black_CLE_Blocks/Total_People_In_Mostly_Black_CLE_Blocks

(CLE_black_rate-CLE_white_rate)/CLE_white_rate
## [1] 0.2558545

For the map, we want to look at each tract, and see the percentage stopped for black, white, and other. (And then Tableau has built-in race by block shading.)

CLE_Percent_Pulled_over_By_Block <- CLEstops_With_Race_and_Tracts %>%
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))

CLE_Percent_Pulled_over_By_Block <- st_set_geometry(CLE_Percent_Pulled_over_By_Block, NULL)
#Uncomment this to export it
#rio::export(CLE_Percent_Pulled_over_By_Block, "Cleveland number of stops per census block.csv")

Looking just at Cleveland Speeders

## Load the data

CLE_speeders <- rio::import("Cleveland just speeders geocoded.xlsx")
CLEstops <-CLE_speeders

# Additional data and fixed values we'll be using
population_2017 <- tibble(
subject_race = c(
"asian/pacific islander", "black", "hispanic", "other/unknown","white"
),
num_people = c(7930, 193817, 43466, 11508, 131338) #This is where I got this data:https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=CF 
#note "white" means 'non-hispanic white'
) %>% 
mutate(subject_race = as.factor(subject_race))

#Have to add in the latitude and longitude of wherever you are (this is slightly different for CLEumbus)
center_lat <- 41.4993
center_lng <- -81.6944

Covering the basics

Note the original data set has more rows, because each offense is a separate row. So someone getting two tickets would show up twice. And we didn’t want that.

colnames(CLEstops)
##  [1] "TicketType"     "Ticket_Date"    "TicketDatetime" "Ticket_Time"   
##  [5] "TicketAddress"  "Full_address"   "DOBTicketed"    "Arrest"        
##  [9] "Race"           "Sex"            "Statute"        "Speeding?"     
## [13] "lon"            "lat"

Looking for duplicates

CLE_duplicates <- CLEstops %>% 
  group_by(Ticket_Date, Ticket_Time, DOBTicketed) %>% 
  summarize (Total=n())
  head(CLE_duplicates)
## # A tibble: 6 x 4
## # Groups:   Ticket_Date, Ticket_Time [6]
##   Ticket_Date Ticket_Time DOBTicketed          Total
##   <chr>       <chr>       <chr>                <int>
## 1 1/1/16      7:46        Jul 13 1989              1
## 2 1/1/17      0:43        Aug 20 1955 12:00 AM     1
## 3 1/1/17      10:25       Sep 26 1981              1
## 4 1/1/17      10:35       Oct 17 1967 12:00 AM     1
## 5 1/1/17      15:25       Feb 14 1966 12:00 AM     1
## 6 1/1/17      15:34       Dec 22 1981 12:00 AM     1

18 people got two speeding tickets each, or they’re twins. Or this is miscoded. But not going to affect ~15,000 stops.

CLEstops <- dplyr::distinct(CLEstops)

Changing data and time to date and time formats

#CLEstops <- filter(CLEstops, !is.na(Ticket_Date))
CLEstops$Ticket_Date <- mdy(CLEstops$Ticket_Date)
CLEstops$Ticket_Time <-parse_hm(CLEstops$Ticket_Time)

CLEstops <- CLEstops %>% 
  mutate(date=Ticket_Date, time=Ticket_Time, subject_race=Race)

max(CLEstops$date)
## [1] "2017-12-30"
min(CLEstops$date)
## [1] "2016-01-01"

How does Cleveland count race?

CLEstops %>% 
count(subject_race)
## # A tibble: 9 x 2
##   subject_race     n
##   <chr>        <int>
## 1 American        23
## 2 Americar        11
## 3 Arab           108
## 4 Asian           76
## 5 Black         7854
## 6 Native H         3
## 7 Native Ha        3
## 8 Unknown        624
## 9 White         6486

There’s some ambiguity in this data. Like, are Hispanics tagged as white? Does “Native H” mean Native American? “America!”… what does that mean? We have sent a request to the police about this.

You might want to do options(scipen=999) so that you don’t get scientific notation here.

# This method builds off of using `count` as above
CLEstops %>% 
count(subject_race) %>% 
mutate(prop = n / sum(n))
## # A tibble: 9 x 3
##   subject_race     n     prop
##   <chr>        <int>    <dbl>
## 1 American        23 0.00151 
## 2 Americar        11 0.000724
## 3 Arab           108 0.00711 
## 4 Asian           76 0.00500 
## 5 Black         7854 0.517   
## 6 Native H         3 0.000198
## 7 Native Ha        3 0.000198
## 8 Unknown        624 0.0411  
## 9 White         6486 0.427
# This method uses the group_by/summarize paradigm
CLEstops %>% 
group_by(subject_race) %>% 
summarize(
n = n(),
prop = n / nrow(.)
)
## # A tibble: 9 x 3
##   subject_race     n     prop
##   <chr>        <int>    <dbl>
## 1 American        23 0.00151 
## 2 Americar        11 0.000724
## 3 Arab           108 0.00711 
## 4 Asian           76 0.00500 
## 5 Black         7854 0.517   
## 6 Native H         3 0.000198
## 7 Native Ha        3 0.000198
## 8 Unknown        624 0.0411  
## 9 White         6486 0.427

At first glance, we see there are more stops of black drivers than any other race.

How about counting how many stops by year and race?

CLEstops %>% 
  group_by(year(date)) %>% 
# notice that you can also mutate `date` *within* the count funciton
count(year = year(date), subject_race)
## # A tibble: 18 x 4
## # Groups:   year(date) [2]
##    `year(date)`  year subject_race     n
##           <dbl> <dbl> <chr>        <int>
##  1         2016  2016 American        13
##  2         2016  2016 Americar         6
##  3         2016  2016 Arab            61
##  4         2016  2016 Asian           40
##  5         2016  2016 Black         4269
##  6         2016  2016 Native H         1
##  7         2016  2016 Native Ha        2
##  8         2016  2016 Unknown        350
##  9         2016  2016 White         3426
## 10         2017  2017 American        10
## 11         2017  2017 Americar         5
## 12         2017  2017 Arab            47
## 13         2017  2017 Asian           36
## 14         2017  2017 Black         3585
## 15         2017  2017 Native H         2
## 16         2017  2017 Native Ha        1
## 17         2017  2017 Unknown        274
## 18         2017  2017 White         3060

A simple visualization of this.

CLEstops %>% 
count(month = month(date), subject_race) %>% 
ggplot(aes(x = month, y = n, color = subject_race)) +
geom_point() +
geom_line() 

Looks pretty consistent month to month.

Note: For the purposes of our analysis, we are going to combine “Native H”, Clevelandm, America, Americar, Arab, and Unknown together.

  CLEstops$subject_race <-recode(CLEstops$subject_race, "Native H"="other/unknown", 
                                "Clevelam"="hispanic", 
                                "America!"="other/unknown",
                                "Americar"="other/unknown",
                                "American"="other/unknown",
                                "Native Ha"="other/unknown",
                                "Arab"="other/unknown", 
                                "Unknown"="other/unknown",
                               "Asian"="asian/pacific islander",
                                "Black"="black",
                                "White"="white")

This preps the proportion.

population_2017 <- population_2017 %>% 
mutate(Proportion_Of_Population = num_people / sum(num_people)) #I added this to the populatin data frame

This creates our stop rate data table.

StopRateDataTable <- CLEstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>% 
left_join(
population_2017,
by = "subject_race"
) %>% 
mutate(stop_rate = n / sum(n))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable)

So it looks like blacks are about 50 percent of the population though they are 60 percent of the stops. There’s a big discrepancy here though: Hispanics are 11 percent of the population, but there is no Cleveland “race” stop labeled “Hispanic.” (We put in one just so that it actually showed up in this table.) This isn’t inaccurate– legally, Hispanic is an “ethnicity” and not a “race.” Though 200 are listed as “Arab” which isn’t a race either. Again, we will have to check with the police department on that.

population_2017 <- tibble(
subject_race = c(
"asian/pacific islander", "black","other/unknown","white"
),
num_people = c(8026, 196006, 30062, 154718) #This is from the 2017 American Community Survey of Cleveland. "Other/unknown" is: 12,479 for 'Some other Race', 60 for native Hawaiian, 1931 for American Indian/Alaskan Native, 15,592 for 'two or more races.' Food for thought: should we include the black and white as (6919) as part of African American?

) %>% 
mutate(subject_race = as.factor(subject_race))


population_2017 <- population_2017 %>% 
mutate(Proportion_Of_Population = num_people / sum(num_people))


CLEstops$subject_race <-recode(CLEstops$subject_race, "hispanic"="other/unknown")


StopRateDataTable_No_Hispanic <- CLEstops %>% #I added this table to compare the stop rate versus the population rate
count(subject_race) %>% 
left_join(
population_2017,
by = "subject_race"
) %>% 
mutate(stop_rate = n / sum(n))
## Warning: Column `subject_race` joining character vector and factor,
## coercing into character vector
datatable(StopRateDataTable_No_Hispanic)

sunset and dusk times for Cleveland.

(Note that we distinguish here between sunset and dusk. Sunset is the point in time where the sun dips below the horizon. Dusk, also known as the end of civil twilight, is the time when the sun is six degrees below the horizon, and it is widely considered to be dark.)

# Get timezone for Cleveland
tz <- lutz::tz_lookup_coords(center_lat, center_lng, warn = F)

# Helper function
time_to_minute <- function(time) {
  hour(lubridate::hms(time)) * 60 + minute(lubridate::hms(time))
}

# Compute sunset time for each date in our dataset
sunset_times <- CLEstops %>%
  mutate(
    lat = center_lat,
    lon = center_lng
    ) %>% 
  select(date, lat, lon) %>%
  distinct() %>%
  getSunlightTimes(
    data = ., 
    keep = c("sunset", "dusk"), 
    tz = tz
  ) %>% 
  mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>% 
  mutate(
    sunset_minute = time_to_minute(sunset),
    dusk_minute = time_to_minute(dusk),
    date = ymd(str_sub(date, 1, 10))
  ) %>% 
  select(date, sunset, dusk, ends_with("minute"))

Great. Now, using sunset_times, what is Cleveland’s inter-twilight period?

sunset_times %>% 
filter(dusk == min(dusk) | dusk == max(dusk))
##         date   sunset     dusk sunset_minute dusk_minute
## 1 2016-06-26 21:05:57 21:39:57          1265        1299
## 2 2016-12-07 16:58:19 17:29:29          1018        1049
## 3 2017-12-08 16:58:18 17:29:29          1018        1049
## 4 2017-06-26 21:05:56 21:39:57          1265        1299
## 5 2016-12-08 16:58:18 17:29:29          1018        1049
## 6 2017-12-07 16:58:20 17:29:29          1018        1049
datatable(.Last.value)

Joining sunset times to our data.

CLE_VODstops <- CLEstops %>% 
  left_join(
    sunset_times,
    by = "date"
    ) %>% 
  mutate(
    minute = time_to_minute(time),
    minutes_after_dark = minute - dusk_minute,
    is_dark = minute > dusk_minute,
    min_dusk_minute = min(dusk_minute),
    max_dusk_minute = max(dusk_minute),
    is_black = subject_race == "black"
    ) %>% 
  filter(
# Filter to get only the intertwilight period
  ###Skipping this part
    minute >= min_dusk_minute,
    minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk (twilight)
!(minute > sunset_minute & minute < dusk_minute)
)

Filtering for just the timeframe needed.

CLE_VODstops <- CLE_VODstops %>% 
  filter(time > hm("16:58"), time < hm("21:39"))

Looking JUST at highways

library(dplyr)
CLE_VODstops <- CLE_VODstops %>% 
  filter(str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))
StopsByYearInDarkness <- CLE_VODstops %>%
filter(is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total) %>% 
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CLE_VODstops %>%
filter(!is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total)%>% 
mutate(sunshine = "Daylight")

This adds the two above together

TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>% 
  group_by(year)
  datatable(TwilightTable)

Cleveland data has no Hispanics

TwilightTable <- TwilightTable %>% 
  mutate(other = `other/unknown`, total_stopped = `black` +`white`+`other/unknown`) %>% 
  group_by(sunshine) %>%
  mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
  #select(black, white, other, year) %>% 
  datatable(TwilightTable)
TwilightTableTogether <- TwilightTable %>% 
  group_by(sunshine) %>% 
  summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
         Percent_of_Whites_Stopped=mean(white_percentage),
         other=mean(other_percentage))
  datatable(TwilightTableTogether)

Looking JUST at street stops. A stop on a highway really isn’t “in” a neighborhood.

CLE_VODstops <- CLEstops %>% 
  left_join(
    sunset_times,
    by = "date"
    ) %>% 
  mutate(
    minute = time_to_minute(time),
    minutes_after_dark = minute - dusk_minute,
    is_dark = minute > dusk_minute,
    min_dusk_minute = min(dusk_minute),
    max_dusk_minute = max(dusk_minute),
    is_black = subject_race == "black"
    ) %>% 
  filter(
# Filter to get only the intertwilight period
  ###Skipping this part
    minute >= min_dusk_minute,
    minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk (twilight)
!(minute > sunset_minute & minute < dusk_minute)
)

taking out the highways, looking for just street stops.

CLE_VODstops <- CLE_VODstops %>% 
  filter(!str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))

Filtering for just the timeframe needed.

CLE_VODstops <- CLE_VODstops %>% 
  filter(time > hm("16:58"), time < hm("21:39"))
StopsByYearInDarkness <- CLE_VODstops %>%
filter(is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total) %>% 
mutate(sunshine = "Darkness")
StopsByYearInDaylight <- CLE_VODstops %>%
filter(!is_dark) %>% 
mutate(year = year(date)) %>% 
group_by(year,subject_race) %>% 
summarize(Total =n()) %>% 
spread(subject_race, Total)%>% 
mutate(sunshine = "Daylight")

This adds the two above together

TwilightTable <- rbind(StopsByYearInDarkness, StopsByYearInDaylight) %>% 
  group_by(year)
  datatable(TwilightTable)

Cleveland data has no Hispanics

TwilightTable <- TwilightTable %>% 
  mutate(other = `other/unknown`, total_stopped = `black` +`white`+`other/unknown`) %>% 
  group_by(sunshine) %>%
  mutate(black_percentage=`black`/total_stopped, white_percentage=white/total_stopped, other_percentage=other/total_stopped)
  #select(black, white, other, year) %>% 
  datatable(TwilightTable)
TwilightTableTogether <- TwilightTable %>% 
  group_by(sunshine) %>% 
  summarize(Percent_Of_Blacks_Stopped=mean(black_percentage),
         Percent_of_Whites_Stopped=mean(white_percentage),
         other=mean(other_percentage))
  datatable(TwilightTableTogether)

Looking at Arrests

Looking at Arrest Rates

ArrestsByRace <- CLEstops %>% 
  group_by(subject_race, Arrest) %>% 
  summarize(Total =n()) %>% 
  group_by(subject_race) %>% 
  mutate(pct=Total/sum(Total))
  datatable(ArrestsByRace)
ArrestsRateByRace <- na.omit(ArrestsByRace) 
ggplot(ArrestsRateByRace, aes(x=subject_race, y=Total, fill=Arrest))+
geom_col(position="dodge")

#ggsave("Cincinnati2012to2017ArrestRateByRace.jpg", plot = last_plot())
Arrest_Percentage_ByRace <- CLEstops %>%
  filter(Arrest=="Yes") %>%
  group_by(subject_race) %>% 
  summarize(Total=n()) %>% 
  mutate(pct=Total/sum(Total))
  datatable(Arrest_Percentage_ByRace)

Note: Cleveland doesn’t have search data.

Shapefiles from https://censusreporter.org/data/table/?table=B02001&geo_ids=16000US3916000,05000US39035,31000US17460,04000US39,01000US,150|16000US3916000,140|16000US3916000,150|16000US3916000&primary_geo_id=16000US3916000

Note these codes are from the (data dictionary.)[https://www.socialexplorer.com/data/ACS2016_5yr/metadata/?ds=ACS16_5yr&table=B02001]

#You'll have to set your file pane location here
setwd("~/Code/SunlightStops/SunlightStopsStory/acs2017_5yr_B02001_15000US390351093011")
CLE_RaceInfoByLocation <- sf::st_read("acs2017_5yr_B02001_15000US390351093011.shp")
## Reading layer `acs2017_5yr_B02001_15000US390351093011' from data source `/Users/Lucia/Code/SunlightStops/SunlightStopsStory/acs2017_5yr_B02001_15000US390351093011/acs2017_5yr_B02001_15000US390351093011.shp' using driver `ESRI Shapefile'
## Simple feature collection with 467 features and 22 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.2302 ymin: -14.60134 xmax: 179.8597 ymax: 71.43979
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
CLE_RaceInfoByLocation <- CLE_RaceInfoByLocation %>% 
  mutate(
  Total=B02001001, 
  White_Alone=B02001002,
  Black_Alone=B02001003,
  PctWhite = round((White_Alone/Total)*100, 1),
  PctBlack = round((Black_Alone/Total)*100, 1)) %>% 
slice(-n())
# We only need the columns with the latitude and longitude
#We need the street stops only here. Stopping somone on the highway is not really in a neighborhood
#CLEstops <- rio::import("CLE_stops_cleaned_and_geocoded.csv")


CLEstops <- dplyr::distinct(CLEstops)
CLEstops$Ticket_Date <- mdy(CLEstops$Ticket_Date)
## Warning: All formats failed to parse. No formats found.
CLEstops$Ticket_Time <-parse_hm(CLEstops$Ticket_Time)
CLEstops <- CLEstops %>% 
  mutate(date=Ticket_Date, time=Ticket_Time, subject_race=Race)

# Check and eliminate the rows that are on a highway, which really isn't "in" a neighborhood
CLEstops <- CLEstops %>% 
  filter(!str_detect(TicketAddress,'90 E|90 W|71 N|71 S|76 N|76 S|77 N|77 S|480 E|480 W|I-71|I-480|I-77|I-76|I-90|I71|I76|I90|I77|I480'))
#Getting Cleveland block groups 
library(tigris)
options(tigris_class = "sf")
CLEBlockGroups <- block_groups("OH",county = c("Cuyahoga","Lorain","Lake","Medina"))

CLEstops <- filter(CLEstops, !is.na(lat))
CLEstops_spatial <- CLEstops %>% 
  st_as_sf(coords=c("lon", "lat"), crs="+proj=longlat") %>% 
  st_transform(crs=st_crs(CLEBlockGroups))



#Putting them altogether
CLEStopsWithTracts <- st_join(CLEBlockGroups, CLEstops_spatial)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
#plotting them
ggplot(CLEBlockGroups) + 
  geom_sf(color="black") +
  geom_point(data=CLEstops, aes(x=lon, y=lat), color="red", alpha=0.05) +
  theme_void() +
  theme(panel.grid.major = element_line(colour = 'transparent')) +
  labs(title="Traffic stops in Cleveland")

#Vlookup
CLEStopsWithTracts$GEOID <- paste0("15000US",CLEStopsWithTracts$GEOID)

This puts the stops and race percentage of each block together.

CLEstops_With_Race_and_Tracts <-  CLEStopsWithTracts

CLEstops_With_Race_and_Tracts$PctWhite = CLE_RaceInfoByLocation$PctWhite[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]

CLEstops_With_Race_and_Tracts$PctBlack = CLE_RaceInfoByLocation$PctBlack[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]

CLEstops_With_Race_and_Tracts$Total_People_Per_Census_Block = CLE_RaceInfoByLocation$Total[match(CLEStopsWithTracts$GEOID, CLE_RaceInfoByLocation$geoid)]
CLEstops_With_Race_and_Tracts <- CLEstops_With_Race_and_Tracts %>% 
  filter(PctWhite!="NA", PctBlack!="NA")

Mostly_White_CLE <- CLEstops_With_Race_and_Tracts %>% 
filter(PctWhite>75) %>% 
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))



Mostly_Black_CLE <- CLEstops_With_Race_and_Tracts %>% 
filter(PctBlack>75) %>% 
group_by(GEOID) %>%
summarize(
Total=n(), Per_Capita=Total/mean(Total_People_Per_Census_Block), Total_In_Block=mean(Total_People_Per_Census_Block))


mean(Mostly_White_CLE$Per_Capita)
## [1] 0.02781577
mean(Mostly_Black_CLE$Per_Capita)
## [1] 0.0487613

This looks at the rate in ALL predominantely black areas and all predominantely white areas together.

And figures out the percentage more in black areas.

Total_People_In_Mostly_White_CLE_Blocks <- sum(Mostly_White_CLE$Total_In_Block)
Total_People_In_Mostly_Black_CLE_Blocks <- sum(Mostly_Black_CLE$Total_In_Block)

Total_Stops_In_Mostly_White_CLE_Blocks <-sum(Mostly_White_CLE$Total)
Total_Stops_In_Mostly_Black_CLE_Blocks <-sum(Mostly_Black_CLE$Total)


Total_Stops_In_Mostly_White_CLE_Blocks/Total_People_In_Mostly_White_CLE_Blocks
## [1] 0.02698263
Total_Stops_In_Mostly_Black_CLE_Blocks/Total_People_In_Mostly_Black_CLE_Blocks
## [1] 0.03113181
CLE_white_rate <- Total_Stops_In_Mostly_White_CLE_Blocks/Total_People_In_Mostly_White_CLE_Blocks
CLE_black_rate <- Total_Stops_In_Mostly_Black_CLE_Blocks/Total_People_In_Mostly_Black_CLE_Blocks

(CLE_black_rate-CLE_white_rate)/CLE_white_rate
## [1] 0.153772