This readme explains how to make best use of the data from the Stanford Open Policing Project. We provide an overview of the data and a list of best practices for working with the data.

Our analysis code and further documentation are available at https://github.com/5harad/openpolicing.

Overview of the data file structure

For each dataset, we provide 4 files:

A zipped csv file of the cleaned data
An RDS of the cleaned data
Tarballed (zipped) shapefiles
Tarballed (zipped) raw data (available upon request)

Description of standardized data

Each row in the cleaned data represents a stop. The following details the maximal set of features we attempted to extract from each location. Coverage varies by location. Fields with an asterisk were removed for public release due to privacy concerns. All columns except raw_row_number, violation, disposition, location, officer_assignment, any city or state subgeography (i.e. county, beat, division, etc), unit, and vehicle_{color,make,model,type} are also digit sanitized (each digit replaced with "-") for privacy concerns.

Note that many locations have additional information that could be extracted (e.g., zip code), but we do not designate a standardized column for information beyond what is listed below, either because we do not use the information in our analysis and/or because not enough locations provided this information. We do pull through some additional columns (discussed on a location-by-location basis within this readme), which have column names prefixed with "raw_".

Column name	Column meaning	Example value
raw_row_number	An number used to join clean data back to the raw data	38299
date	The date of the stop, in YYYY-MM-DD format. Some states do not provide the exact stop date: for example, they only provide the year or quarter in which the stop occurred. For these states, stop_date is set to the date at the beginning of the period: for example, January 1 if only year is provided.	"2017-02-02"
time	The 24-hour time of the stop, in HH:MM format.	20:15
location	The freeform text of the location. Occasionally, this represents the concatenation of several raw fields, i.e. street_number, street_name	"248 Stockton Rd."
lat	The latitude of the stop. If not provided by the department, we attempt to geocode any provided address or location using Google Maps. Google Maps returns a "best effort" response, which may not be completely accurate if the provided location was malformed or underspecified. To protect against suprious responses, geocodes more than 4 standard deviations from the median stop lat/lng are set to NA.	72.23545
lng	The longitude of the stop. If not provided by the department, we attempt to geocode any provided address or location using Google Maps. Google Maps returns a "best effort" response, which may not be completely accurate if the provided location was malformed or underspecified. To protect against suprious responses, geocodes more than 4 standard deviations from the median stop lat/lng are set to NA.	115.2808
geocode_source	The geocoding service used to geocode the address. Either Google Maps Geocoding API or the Stanford Geospatial Center's geocoding serive. This row is only present for locations that are part of the 2023 update. (See below for further details.)	"GM"
county_name	County name where provided	"Allegheny County"
neighborhood	This is the neighborhood of the stop and some police departments will provide this instead of a location or beat.	"GRNBELT"
beat	Police beat. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the beat using the shapefiles.	8
district	Police district. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the district using the shapefiles.	8
subdistrict	Police subdistrict. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the subdistrict using the shapefiles.	8
division	Police division. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the division using the shapefiles.	8
subdivision	Police subdivision. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the subdivision using the shapefiles.	8
police_grid_number	Police grid number. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the police grid number using the shapefiles.	8
precinct	Police precinct. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the precinct using the shapefiles.	8
region	Police region. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the region using the shapefiles.	8
reporing_area	Police reporting area. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the reporting area using the shapefiles.	8
sector	Police sector. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the sector using the shapefiles.	8
subsector	Police subsector. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the subsector using the shapefiles.	8
substation	Police substation. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the substation using the shapefiles.	8
service_area	Police service area. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the service area using the shapefiles.	8
zone	Police zone. If not provided, but we have retrieved police department shapefiles and the location of the stop, we geocode the stop and find the zone using the shapefiles.	8
subject_age	The age of the stopped subject. When date of birth is given, we calculate the age based on the stop date. Values outside the range of 10-110 are coerced to NA.	54.23
subject_dob*	The date of birth of the stopped subject.	"1956-02-23"
subject_yob*	The year of birth of the subject.	1983
subject_race	The race of the stopped subject. Values are standardized to white, black, hispanic, asian/pacific islander, and other/unknown	"hispanic"
subject_sex	The recorded sex of the stopped subject.	"female"
officer_id*	Officer badge number or other form of identification provided by the department.	8
officer_id_hash	A unique hash of the officer id used to identify individual officers within a location. This is usually just a hash of the provided officer ID or badge number. In some places, notably South Carolina Statewide data and the Seattle, WA city data, this ID is not unique, so we hash it with other officer attributes to make it unique, i.e. officer_last_name, officer_race, etc. In about half of locations, however, we get an officer ID but no other officer information, so our ability to test for uniqueness and deduplicate is limited. In other locations, a small number of IDs are duplicated because of what appears to be a data entry issue, i.e. Chicago, IL, where 0.3% of officer IDs appear to be only the prefix of their ID.	"a888fdc120"
officer_age	The age of the stopped officer. When date of birth is given, we calculate the age based on the stop date. Values outside the range of 10-100 are coerced to NA.	54.23
officer_dob*	The date of birth of the stopped officer.	"1956-02-23"
officer_race	The race of the stopped officer. Values are standardized to white, black, hispanic, asian/pacific islander, and other/unknown	"hispanic"
officer_sex	The recorded sex of the stopped officer.	"female"
officer_first_name*	First name of the officer when provided.	"MIGUEL"
officer_last_name*	Last name of the officer when provided.	"JEFFERSON"
officer_years_of_service	Number of years officer has been with the police department.	22
officer_assignment	Department or subdivision to which officer has been assigned.	"8th District"
department_id	ID of department or subdivision to which officer has been assigned.	90
department_name	Name of department or subdivision to which officer has been assigned.	90
unit	Unit to which officer has been assigned.	"Patrol-1st"
type	Type of stop: vehicular or pedestrian.	"vehicular"
disposition	Disposition of stop where provided. What is recorded here varies widely across police departments.	"GUILTY"
violation	Specific violation of stop where provided. What is recorded here varies widely across police departments.	"SPEEDING 15-20 OVER"
arrest_made	Indicates whether an arrest made.	FALSE
citation_issued	Indicates whether a citation was issued.	TRUE
warning_issued	Indicates whether a warning was issued.	TRUE
outcome	The strictest action taken among arrest, citation, warning, and summons.	"citation"
contraband_found	Indicates whether contraband was found. When search_conducted is NA, this is coerced to NA under the assumption that contraband_found shouldn't be discovered when no search occurred and likely represents a data error.	FALSE
contraband_drugs	Indicates whether drugs were found. This is only defined when contraband_found is true.	TRUE
contraband_weapons	Indicates whether weapons were found. This is only defined when contraband_found is true.	TRUE
contraband_other	Indicates whether contraband other than drugs and weapons were found. This is only defined when contraband_found is true.	TRUE
frisk_performed	Indicates whether a frisk was performed. This is technically different from a search, but departments will sometimes include frisks as a search type.	TRUE
search_conducted	Indicates whether any type of search was conducted, i.e. driver, passenger, vehicle. Frisks are excluded where the department has provided resolution on both.	TRUE
search_person	Indicates whether a search of a person has occurred. This is only defined when search_conducted is TRUE.	TRUE
search_vehicle	Indicates whether a search of a vehicle has occurred. This is only defined when search_conducted is TRUE.	TRUE
search_basis	This provides the reason for the search where provided and is categorized into k9, plain view, consent, probable cause, and other. If a serach occurred but the reason wasn't listed, we assume probable cause.	"consent"
reason_for_arrest	A freeform text field indicating the reason for arrest where provided.	"outstanding warrant"
reason_for_frisk	A freeform text field indicating the reason for frisk where provided.	"suspicious movement"
reason_for_search	A freeform text field indicating the reason for search where provided.	"odor of marijuana"
reason_for_stop	A freeform text field indicating the reason for the stop where provided.	"EQUIPMENT MALFUNCTION"
speed	The recorded speed of the vehicle for the stop.	76.2
posted_speed	The speed limit where the stop was recorded.	55
use_of_force_description	A freeform text field describing the use of force.	"handcuffed"
use_of_force_reason	A freeform text field describing the reason for the use of force.	"weapons / violence related incident"
vehicle_color	A freeform text of the vehicle color where provided; format varies widely.	"BLK"
vehicle_make	A freeform text of the vehicle make where provided; format varies widely.	"TOYOTA"
vehicle_model	A freeform text of the vehicle model where provided; format varies widely.	"Cherokee"
vehicle_type	A freeform text of the vehicle type where provided; format varies widely.	"TRUCK"
vehicle_registration_state	A freeform text of the vehicle registration state where provided; format varies widely.	"CA"
vehicle_year	Vehicle manufacture year where provided. This value is NA for any year before 1800.	2007
notes	A freeform text field containing any officer notes.	"NO PASSENGERS"

* Removed for public release for privacy reasons.

Best practices

We provide some lessons we’ve learned from working with this rich, but complicated data.

Read over the notes and processing code if you are going to focus on a particular location, so you’re aware of the judgment calls we made in processing the data. Taking a look at the original raw data is also wise (and may uncover additional fields of interest).
Start with the cleaned data from a single small location to get a feel for the data. Rhode Island, Vermont, and Connecticut are all load quickly.
Note that loading and analyzing every state simultaneously takes significant time and computing resources. One way to get around this is to compute aggregate statistics from each state. For example, you can compute search rates for each age, gender, and race group in each state, save those rates, and then quickly load them to compute national-level statistics broken down by age, race, and gender.
Take care when making direct comparisons between locations. For example, if one state has a far higher consent search rate than another state, that may reflect a difference in search recording policy across states, as opposed to an actual difference in consent search rates.
Examine counts over time in each state: for example, total numbers of stops and searches by month or year. This will help you find years for which data is very sparse (which you may not want to include in analysis).
Do not assume that all disparities are due to discrimination. For example, if young men are more likely to receive citations after being stopped for speeding, this might simply reflect the fact that they are driving faster.
Do not assume the standardized data are absolutely clean. We discovered and corrected numerous errors in the original data, which were often very sparsely documented and changed from year to year, requiring us to make educated guesses. This messy nature of the original data makes it unlikely the cleaned data are perfectly correct.
Do not read too much into very high stop, search, or other rates in locations with very small populations or numbers of stops. For example, if a county has only 100 stops of Hispanic drivers, estimates of search rates for Hispanic drivers will be very noisy and hit rates will be even noisier. Similarly, if a county with very few residents has a very large number of stops, it may be that the stops are not of county residents, making stop rate computations misleading.

The following contains date ranges, coverage rates, and some notes on each location. A coverage rate is 1 - null rate, so it represents the proportion of data that have values for that feature. The reported coverage rates are also predicated, which means that some columns coverage is calculated only after considering another column. For instance, the coverage for contraband_found is reported after filtering to instances where search_conducted was true. In a similar fashion, search_basis and reason_for_search are only calculated when search conducted is true, reason_for_arrest when arrest_made is true, and contraband_drugs, contraband_weapons, and contraband_alcohol, and _contraband_other when contraband_found is true.

The notes are not intended to be a comprehensive description of all the data features in every state, since this would be prohibitively lengthy. Rather, they are brief observations we made while processing the data. We hope they will be useful to others. They are worth reading prior to performing detailed analysis of a location.

Our analysis only scratches the surface of what’s possible with these data. We’re excited to see what you come up with!

2023 Update

We are releasing 2.8 million additional stops for the following locations:

Mesa, AZ,
Aurora, CO,
Chicago, IL,
Wichita, KS,
Louisville, KY
Oklahoma City, OK,
Statewide, MT,
Houston, TX,
San Antonio, TX,
Burlington, VT,
Madison, WI.

The new data cover the period 2017–2020, as available on a location-by-location basis. We geocoded new addresses in the updated data through the Stanford Geospatial Center. A new column, labelled geocode_source, indicates whether the address corresponding to a particular record was geocoded using the Google Maps geocoding service or the Stanford Geospatial Center's geocoding service.

In addition to the new data, the updating process has resulted in minor changes to the standardization of existing data in the locations listed above. Most of these changes are location-specific and minor, and are described in more detail in the entry for the corresponding city. However, two changes have affected data from existing periods more broadly:

Some addresses which were not previously geocoded were successfully geocoded in the data refresh. As a result, addresses corresponding to those locations in existing data (i.e., data on stops up to and including early 2018) may now have latitude and longitude designations that previously did not.
The latitudes and longitudes of a large number of records are no longer exactly equal (as double-precision floats) to their previous values. The discrepancy is confined to the two least significant bits. In particular, the discrepancies are all at most around $10^{-14}$ degrees of longitude or lattitude, or roughly one nanometer of real-world distance. While we are unsure of the exact source of these discrepancies, we think it they likely resulted from a change to the readr package's decimal parsing that occurred between the previous and current opp data releases. (See, e.g., here for some relevant details.)

Finally, we have corrected the county codes for the California Highway Patrol data (Statewide, CA), but have added no new data.

Little Rock, AR

2017-01-01 to 2017-11-03

feature	coverage rate
date	100.0%
time	100.0%
lat	2.4%
lng	2.4%
subject_age	99.8%
subject_race	100.0%
subject_sex	99.8%
officer_first_name	100.0%
officer_last_name	100.0%
type	100.0%
citation_issued	100.0%
outcome	100.0%
vehicle_type	100.0%
raw_defendant_race	100.0%

Data notes:

lat/lng data doesn't appear totally accurate, there are ~18k lat/lngs that were coerced to NA because they all equalled "-1.79769313486232E+308"
Data is deduplicated on date, time, lat, lng, race, sex, and officer name, reducing the number of records by ~30.6%
Data consists only of citations
raw_defendant_race represents Defendant Race in the raw data and is the column from which subject_race is derived

Gilbert, AZ

2008-01-01 to 2018-05-23

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	99.9%
lng	99.9%
officer_id	100.0%
officer_id_hash	100.0%
officer_first_name	100.0%
officer_last_name	100.0%
type	100.0%
vehicle_color	0.1%
vehicle_make	0.1%
vehicle_model	0.1%
vehicle_year	0.0%

Data notes:

Data is deduplicated on call_id, reducing the number of records 17.6%; this was equivalent to deduping on date, time, location, and officer_id; subject name appears to have been entered multiple times per call_id, and often in subtly different formats
Most important data is missing, including outcome (arrest, citation, warning), reason for stop, search, contraband, and demographic information on the subject (except name, which is redacted for privacy)
call_type was either TS (traffic stop) or SS (subject stop), which we translated to 'vehicular' or 'pedestrian' stops

Mesa, AZ

2014-01-01 to 2019-09-23

feature	coverage rate
date	100.0%
time	99.6%
location	99.6%
lat	99.3%
lng	99.3%
geocode_source	99.3%
subject_age	98.4%
subject_race	100.0%
subject_sex	98.5%
officer_id	100.0%
officer_id_hash	100.0%
officer_last_name	100.0%
type	93.7%
violation	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
raw_race_fixed	100.0%
raw_ethnicity_fixed	100.0%
raw_charge	100.0%

Data notes:

INCIDENT_NO appears to refer to the same incident but can involve multiple people, i.e. 20150240096, which appears to be an alcohol bust of several underage teenagers; in other instances, the rows look nearly identical, but given this information and checking several other seeming duplicates, it appears as though there is one row per person per incident
violation is charge_desc in the raw data, and raw_charge represents the charge code in the raw data
subject_race was derived from ethnicity_fixed and race_fixed in the raw data, provided in the clean data with raw_*

2023 update:

60,394 new stops added.
More accurate data has been used for early 2017, resulting in modest changes (e.g., one letter differences) in approximately 700 rows, as well as 18 new rows corresponding to stops that were not previously represented.

Statewide, AZ

2009-01-06 to 2017-12-31

feature	coverage rate
date	99.9%
time	99.9%
location	99.9%
county_name	89.3%
subject_race	99.9%
subject_sex	99.9%
officer_id	99.9%
officer_id_hash	99.9%
type	100.0%
violation	33.2%
arrest_made	99.9%
citation_issued	99.9%
warning_issued	99.9%
outcome	89.3%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_other	6.0%
search_conducted	100.0%
search_person	98.5%
search_vehicle	100.0%
search_basis	100.0%
reason_for_stop	69.3%
vehicle_type	97.2%
vehicle_year	98.4%
raw_Ethnicity	99.9%
raw_OutcomeOfStop	99.9%
raw_ReasonForStop	99.9%
raw_TypeOfSearch	3.3%
raw_ViolationsObserved	56.5%

Data notes:

Counties were mapped in two ways. First, we determined which counties the codes in the County field referred to by using the highways which appeared most frequently in each coded county. Second, for stops which had no data in the County field, we used the values in the Highway and Milepost fields to estimate where the stop took place. For this, we relied on highway marker maps (sources: here and here to map the most frequently traversed highways, which covered the vast majority of stops. Using these two methods, we were able to map 95% of stops which had any location data (i.e., values in either County or Highway and Milepost), and 89% of stops overall.
It would be possible to map the highway and mile marker data to geo coordinates, like we did in Washington.
There is a two-week period in October 2012 and a two-week period in November 2013 when no stops are recorded. We also are missing December 2015. Dates are sparse in 2009–2010 (and even up until mid-2011).
We also received a file with partial data on traffic stops pre-2009; this is not included in the dataset.
Data for violation reason is largely missing.
Raw column VehicleSearchAuthority and DriverSearchAuthority seem to provide search basis but we lack a mapping for the codes. ConsentSearchAccepted gives us information on search type for a small fraction of searches.
raw_TypeOfSearch includes information on who was searched (e.g., driver vs. passenger), but does not provide information on the type of search (e.g., probable cause vs. consent).
Some contraband information is available and so we define a contraband_found column in case it is useful to other researchers. But the data is messy and there are multiple ways contraband_found might be defined, and so we do not include Arizona in our contraband analysis.
Additional raw data columns that may be of interest: ConsentSearchRequested (note that there is also a raw column ConsentSearchAccepted -- which populates the clean values search_basis == "consent"), IfConsentRequestGranted, (FS, RS, NA), SubjectDemeanor (CO, UN, CM, NA), StopDuration (A-F, NA), DistractedDriving (1-2 word free field), ImmigrationStatusCheck (boolean, nearly all NA), VehicleImpounded (Y, N, I, NA), ImpoundReason (LI, NL, CN, DE, DM, II, UA, NA), TypeOfContact (D, P, E, N, C, NA), DrugSeizureType (combinations of P, S, T), DUIBAC, DUICharges, DUITests (combinations of B, I, U), PreStopIndicator (VT = "Vehicle Type, Condition or Modification", BL = "Driver Body Language", PB = "Passenger Behavior", DB = "Driving Behavior", OT = "Other", NO = "None")

Anaheim, CA

2012-01-01 to 2017-03-14

feature	coverage rate
date	100.0%
type	100.0%
reason_for_stop	100.0%

Data notes:

Very little information received, only a reference number, date, year, case type (with no translation), and a case type (with no translation)
reason_for_stop is Final Case Type D in the raw data

Bakersfield, CA

2008-03-09 to 2018-03-09

feature	coverage rate
date	100.0%
time	99.5%
location	99.9%
lat	98.6%
lng	98.6%
beat	91.3%
subject_age	99.5%
subject_dob	99.4%
subject_race	99.6%
subject_sex	99.6%
officer_id	100.0%
officer_id_hash	100.0%
type	100.0%
citation_issued	100.0%
outcome	100.0%
raw_ethnicity	95.8%
raw_statute_name	100.0%
raw_statute_section	100.0%
raw_race	99.6%

Data notes:

Data is deduplicated on raw columns date_of_birth, subject_address, ethnicity, gender_code, occ_date, occ_time, reducing the number of records by ~1.2%
Data does not include reason for stop, search, contraband fields
Missing data dictionaries for ticket classes, ticket statuses, and statute section
subject_race is based on ethnicity and race, the raw columns are provided in the clean data
We currently have no data dictionaries for statute_section and statute_name, but they are passed through to the clean data
Data consists only of citations

Oakland, CA

2013-04-01 to 2017-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	99.9%
lng	99.9%
beat	45.7%
subject_age	23.0%
subject_race	100.0%
subject_sex	99.9%
officer_assignment	9.0%
type	85.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	74.4%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
search_conducted	100.0%
search_basis	100.0%
reason_for_stop	100.0%
use_of_force_description	12.5%
raw_subject_sdrace	100.0%
raw_subject_resultofencounter	100.0%
raw_subject_searchconducted	100.0%
raw_subject_typeofsearch	60.9%
raw_subject_resultofsearch	16.3%

Data notes:

Data is deduplicated on raw columns contactdate, contacttime, streetname, subject_sdrace, subject_sex, and subject_age, reducing the number of records by ~5.2%
Stops from 2013-2015 don't have encountertype like 2016-2017, so we attempt to pull it out from ReasonForEncounter; however, this breakdown is imprecise, because while one category is "Traffic Violation", another is "Probable Cause"; presumably, "Probable Cause" could be a reason for a vehicular stop; so, the stop is type vehicular if the encountertype was vehicular or the reason for encounter involved a traffic violation; it was classified as pedestrian if the encountertype was pedestrian or bicycle, otherwise this field is NA, since we can't say whether "Probable Cause" or "Reasonable Suspicion" was a vehicular or pedestrian stop
Contraband is encoded based on ResultOfSearch (pedestrian) and subject_resultofsearch (vehicular); None, NA, and anything with "Returned" after it are excluded, i.e. "Marijuana - Returned", "Other Weapons - Returned," under the assumption that returned items were not contraband
2013 is missing the first 3 months of data and 2015 is missing the last 3 months of data
Some of the raw columns were named similarly but not exactly the same across years, i.e. ResultOfSearch in 2013, 2014, and 2015, but subject_resultofsearch in 2016 and 2017; these were renamed to be consistent with the latter years in the raw data loading function
subject_{resultofencounter,typeofsearch,search_conducted,resultofsearch} formed the foundation for search and contraband fields and are passed through in the clean data
subject_race is derived from subject_sdrace, which is passed through to the clean data

San Bernardino, CA

2011-12-13 to 2017-09-19

feature	coverage rate
date	100.0%
time	100.0%
location	98.6%
lat	93.0%
lng	93.0%
type	72.0%
disposition	99.7%
arrest_made	99.7%
citation_issued	99.7%
outcome	37.8%
raw_CallType	100.0%

Data notes:

Data is deduplicated on raw columns CreateDateTime, Address, and CallType, removing ~26.3% of records
Data does not include most useful information, including demographic, outcome, and search/contraband information, so the deduplication above potentially over-deduplicates
type is derived from CallType, which is passed through since we lack a data dictionary for some of them

Long Beach, CA

2008-01-01 to 2017-12-31

feature	coverage rate
date	100.0%
location	100.0%
lat	74.4%
lng	74.4%
beat	66.2%
district	66.2%
subdistrict	66.2%
division	66.2%
subject_age	94.5%
subject_race	100.0%
subject_sex	99.9%
officer_id	100.0%
officer_id_hash	100.0%
officer_age	100.0%
officer_race	100.0%
officer_sex	100.0%
officer_years_of_service	100.0%
type	92.2%
violation	99.7%
citation_issued	100.0%
outcome	100.0%
vehicle_make	85.6%
vehicle_registration_state	83.9%
vehicle_year	81.2%
raw_race	100.0%
raw_sex	100.0%
raw_officer_race	100.0%

Data notes:

Data is deduplicated on raw columns Date, Location, Race, Sex, and Officer DID, reducing the number of records by ~14.3%
Data does not include reason for stop, search, or contraband fields
violation is a concatenation of 4 violation descriptions, separated by ';'
type is derived from violation_1_description
raw columns sex, race, and officer_race are passed through since our translations may simplify them
There is a notable drop in stops from 2008 to 2016, unclear what the origin of this may be

Los Angeles, CA

2010-01-01 to 2018-06-23

feature	coverage rate
date	100.0%
time	100.0%
district	100.0%
region	100.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
type	100.0%
raw_descent_description	100.0%

Data is deduplicated on raw columns stop_date, stop_time, reporting_district, division_description_1, division_description_2, officer_1_serial_number, officer_2_serial_number, descent_description, sex_code, and stop_type, reducing the number of records by ~17.7%
Search/contraband, outcome, and location data are missing
subject_race is derived from descent_description, which is passed through

San Diego, CA

2014-01-01 to 2017-03-31

feature	coverage rate
date	100.0%
time	99.8%
service_area	100.0%
subject_age	96.9%
subject_race	99.7%
subject_sex	99.8%
type	100.0%
arrest_made	90.9%
citation_issued	91.6%
warning_issued	91.6%
outcome	89.8%
contraband_found	100.0%
search_conducted	100.0%
search_person	99.4%
search_vehicle	99.4%
search_basis	100.0%
reason_for_search	87.7%
reason_for_stop	99.9%
raw_action_taken	91.6%
raw_subject_race_description	99.7%

Data notes:

stop_id in raw data doesn't appear to apply to unique events, as the same id has different service_area, subject_race, subject_age, and subject_sex, i.e.1099162
Data is deduplicated on raw columns timestamp, subject_race, subject_sex, subject_age, and service_area, reducing the number or records by ~2.0%
There are no locations, but service_area is provided
subject_race is derived from subject_race_description which is passed through
reason_for_search is named SearchBasis in the raw data and search_basis is derived from this column
outcomes are based on ActionTaken, which is passed through as raw_action_taken
search_conducted is named searched in the raw data; when searched is NA, this is interpreted as FALSE for search_conducted under the assumption that officers sometimes don't record the absence of a search. Furthermore, where searched is NA, SearchBasis, SearchBasisOther, and SearchType are all NA, as well, suggesting that no search occurred
If search_conducted was true but contraband_found was NA, it was changed to false, under the assumption that NA means false when a search is performed
2017 only has data for part of the year

San Francisco, CA

2007-01-01 to 2016-06-30

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	99.8%
lng	99.8%
district	94.2%
subject_age	93.5%
subject_race	100.0%
subject_sex	100.0%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	98.3%
contraband_found	100.0%
search_conducted	100.0%
search_vehicle	100.0%
search_basis	100.0%
reason_for_stop	99.8%
raw_search_vehicle_description	100.0%
raw_result_of_contact_description	100.0%

Data notes:

Search basis in the raw data is only "No Search", consent, or other (inventory, incident to arrest, and parole searches)
Contraband found is derived from search_vehicle_description, which, unfortunately, only has the search basis and "Positive Result" or "Negative Result", the former indicating when contraband found is true; it is passed through to the clean data
outcomes are based on result_of_contact_description, which is passed through
Data is deduplicated on raw columns date, time, race_description, sex, age, location, removing ~0.3% of stops

San Jose, CA

2013-09-01 to 2018-03-31

feature	coverage rate
date	100.0%
time	100.0%
location	92.5%
lat	88.9%
lng	88.9%
subject_race	96.4%
type	88.7%
arrest_made	99.8%
citation_issued	99.8%
outcome	38.0%
contraband_found	100.0%
search_conducted	97.7%
reason_for_stop	94.8%
use_of_force_description	88.6%
use_of_force_reason	92.2%
raw_search	96.4%
raw_call_desc	100.0%
raw_race	96.4%
raw_event_desc	99.8%

Data notes:

event_number in raw data has indeterminate meaning, several event numbers occur at the same time but have up to 16 duplicates; however, some of these involve different subjects, so it's unclear whether they are distinct incidents or large incidents involving many people
Data is deduplicated using date, time, location, subject race, and raw_search (SEARCH in raw data); this removes about ~4.4% of records, but many of these rows are lacking sufficient information for differentiation, i.e. they have NA for many of their values
search_conducted is derived from SEARCH (raw_search in clean data); NAs in the original column and converted to FALSE, under the assumption that officers sometimes don't record the absence of a search. However, there are other values other than the ones provided in the data dictionary (and are not NA), these are converted to NA for search_conducted but available in the raw_search column for review; some of them appear to be the result malformed rows and/or incorrect data entry, i.e. some of them could be race classifications
type is based on TYCOD DESCRIPTION, which is passed through as raw_call_desc
race is passed through to provide access to greater granularity
a translation of EVENT DISPO is provided as raw_event_desc; this was used for outcomes
2013 and 2018 only have partial data

Santa Ana, CA

2014-06-11 to 2018-04-13

feature	coverage rate
date	100.0%
location	100.0%
lat	99.9%
lng	99.9%
district	96.1%
region	96.6%
subject_race	99.8%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
type	99.9%
violation	100.0%
citation_issued	100.0%
outcome	100.0%
raw_race	99.8%

Data notes:

Deduping on raw columns Date, Race, Sex, Violation Description, Officer (Badge), and Primary Street would reduce this dataset by ~9.7%, but there is insufficient information to justify this without the incident time. For instance, the highest frequency "incident" deduping on that critera was 16 male Hispanic drivers failing to stop at a stop sign by the same officer on 5th Street; while this could be 16 duplicates, it could also be the same officer pulling over 16 people throughtout that day
Data does not include search or contraband information
Data includes only citations
2014 and 2018 only contain partial data

Statewide, CA

2009-07-01 to 2016-06-30

feature	coverage rate
date	100.0%
county_name	99.7%
district	99.7%
subject_race	100.0%
subject_sex	100.0%
department_name	100.0%
type	100.0%
violation	100.0%
arrest_made	69.8%
citation_issued	69.8%
warning_issued	69.8%
outcome	69.8%
contraband_found	4.3%
frisk_performed	0.2%
search_conducted	100.0%
search_person	96.8%
search_basis	100.0%
reason_for_stop	100.0%
raw_race	100.0%
raw_search_basis	100.0%

Data notes:

CHP districts roughly map to counties, so we mapped stops to counties using the map of CHP districts, which is included in the raw data. Some counties appear to have very high stop rates; this is because they have very small populations. It seems likely that the stops occurring in those counties are not actually the resident population.
Driver age categories are included in the raw data; these cannot be mapped to granular values, so we cannot fill out the driver_age field.
Driver race was recorded with high granularity. Raw mapping:
- A = Other Asian
- B = Black
- C = Chinese
- D = Cambodian
- F = Filipino
- G = Guamanian
- H = Hispanic
- I = Indian
- J = Japanese
- K = Korean
- L = Laotian
- O = Other
- P = Other Pacific Islander
- S = Samoan
- U = Hawaiian
- V = Vietnamese
- W = White
- Z = Asian Indian subject_race is mapped from raw_race above.
Search basis was recorded more finely in raw data. Raw mapping:
- 1 = Probable Cause (positive)
- 2 = Probable Cause (negative)
- 3 = Consent (positive), 202D Required
- 4 = Consent (negative), 202D Required
- 5 = Incidental to Arrest
- 6 = Vehicle Inventory
- 7 = Parole / Probation / Warrant
- 8 = Other
- 9 = Pat Down / Frisk search_basis is mapped from raw_search_basis above.
Very few consent searches are conducted relative to other states.
Contraband found information is only available for a small subset of searches: the raw data can tell you if a probable cause search or a consent search yielded contraband, but cannot tell you if contraband was located during a search conducted incident to arrest. (Note that in many cases we cast NA contraband to F, but in this case we do not, because we simply do not have contraband recovery data for non-discretionary searches). We still include California in our contraband analysis because exclude non-discretionary searches like those incident to arrest.
Raw data contains shift time is included, but is not sufficiently granular to yield reliable stop time.

2023 update:

District and county codes have been corrected.

Stockton, CA

2012-01-01 to 2016-12-31

feature	coverage rate
date	100.0%
division	99.6%
subject_age	99.4%
subject_race	99.5%
subject_sex	99.6%
officer_id	54.9%
officer_id_hash	54.9%
type	100.0%
arrest_made	99.7%
citation_issued	99.7%
warning_issued	99.7%
outcome	99.5%
search_conducted	100.0%
search_basis	100.0%
reason_for_stop	99.6%
raw_result	99.7%
raw_search	99.7%

Data notes:

Data consists of two sets of files, traffic stop surveys and CAD stop files, but currently there is no information on how to join them; location is in the stop files, but all other demographic information is in the traffic stop survey files
There may be duplicates, but unclear how to identify them, as date, age, gender, and race are the only consistently filled in fields, and the maximum number of stops for any date, age, gender, race combination is 10, which is a reasonable number of stops for that combination over the course of a day in the entire city occasionally
officer_id is coalesced officer_id and officer_id2, the former being 90% null and the latter 50% null in the dataset
Outcomes are based on raw column result, which is passed through
search_conducted and search_basis are derived from the raw column search, which is passed through; where SEARCH was NA, search_conducted as set to false, under the assumption that sometimes officers don't record the absence of a search
2012 has suspiciously little data

Aurora, CO

2012-01-01 to 2020-07-31

feature	coverage rate
date	100.0%
time	99.5%
location	100.0%
lat	96.9%
lng	96.9%
geocode_source	96.9%
subject_age	64.9%
subject_dob	64.8%
subject_race	99.9%
subject_sex	99.2%
subject_first_name	67.1%
subject_last_name	67.1%
type	95.7%
violation	98.2%
citation_issued	100.0%
outcome	100.0%
raw_ethnicity	8.7%
raw_race	100.0%

Data notes:

Data is deduplicated on raw columns Ticket Date, Ticket Time, Ticket Location, First Name, Last Name, sex, and Date of Birth, reducing the number of records by ~1.0%
subject_race was based on Race and Ethnicity in the raw data, which are passed through

2023 update:

84,677 new stops added.

Denver, CO

2010-12-31 to 2018-07-19

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	100.0%
lng	100.0%
district	100.0%
precinct	100.0%
type	100.0%
disposition	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	54.8%

Data notes:

MASTER_INCIDENT_NUMBER has many duplicates, but it's unclear what it corresponds to or how to deduplicate it if that is the correct thing to do, since the records are nearly identical except for the NEIGHBORHOOD_NAME
Data does not contain subject demographic or search/contraband information

Statewide, CO

2010-01-01 to 2017-12-31

feature	coverage rate
date	100.0%
time	0.0%
location	100.0%
county_name	100.0%
subject_age	72.4%
subject_dob	70.3%
subject_race	87.1%
subject_sex	71.2%
officer_id	83.0%
officer_id_hash	83.0%
officer_sex	31.8%
officer_first_name	82.9%
officer_last_name	82.9%
type	100.0%
violation	83.9%
arrest_made	54.1%
citation_issued	54.1%
warning_issued	54.1%
outcome	41.6%
contraband_found	100.0%
search_conducted	92.6%
search_basis	98.0%
raw_Ethnicity	87.8%

Data notes:

The state did not provide us with mappings for every police department code to police department name.
Arrest and citation data are unreliable from 2014 onward. Arrest rates drop essentially to zero.
Counties were mapped using a dictionary provided by the agency. Denver County has many fewer stops than expected given the residential population; this is because it only contains a small section of highway which is policed by the state patrol.
Rows in raw data represent violations, not stops, so we remove duplicates by grouping by the other fields.
subject_race was mapped from raw_Ethnicity.
Note that data from 2016 came with about 80 fewer columns than the data pre-2016 and after 2016, so many values for that year will be NA, including search data (see below for details).
The data came in three files, the first covered 2010-March 2016; this has full data. The second covered Jan-Dec 2016; this was missing many columns, including whether a search was conducted. The third data file covered Jan-Dec 2017 and had full data. In order to preserve as much search data as possible we use the second file with missing data only to fill in the nine months of April-Dec 2016. This, in particular, affects the marijuana analysis search rate time series.
Additional columns in the raw data that may be of interest: MMJCard, DUIDType, NonUS, NonUSDL, NonUSDLLocation, DLCheck, TrafficAccident, AccidentSeverity (0-4), DUIArrest, HVPTCitation, SeatBeltCitation, FelonyArrest, Misdemeanors, Felonies (count), VehicleInspected, Recoveries, TrafficOral, AssistOral, AllOtherOral, GrantCategory, GrantLabel, Assists, AssistsMultiple, AssistsCount, ContrabandCharge (petty offense, felony, misdemeanor, none, traffic), Warrant, MisdemeanorOrFelony (M or F)

Statewide, CT

2013-10-01 to 2015-10-01

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	21.5%
lng	21.5%
county_name	100.0%
subject_age	99.7%
subject_race	100.0%
subject_sex	100.0%
officer_id	89.0%
officer_id_hash	89.0%
department_name	100.0%
type	98.2%
violation	99.9%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	98.4%
contraband_found	100.0%
search_conducted	100.0%
search_vehicle	100.0%
search_basis	95.0%
reason_for_stop	50.4%
raw_SubjectRaceCode	100.0%
raw_SubjectEthnicityCode	100.0%
raw_SearchAuthorizationCode	100.0%

Data notes:

Counties were mapped by running the cities in the Intervention Location Name field through Google's geocoder.
Rows appear to represent violations, not individual stops, because a small proportion of rows (1%) report the same officer making multiple stops at the same location at the same time. We grouped the data to combine these duplicates. We don't want to be overly aggressive in grouping together stops, so we only group if the other fields are the same.
While there is some search type data, a high fraction of searches are marked as "Other".
While there is some violation data, too much is missing.
Race (raw_SubjectRaceCode, raw_SubjectEthnicityCode) mapping:
- A = Asian/Pacific Islander
- B = Black
- H = Hispanic
- W = White
- I = Native American
Search basis (raw_SearchAuthorizationCode) mapping:
- C = consent
- O = probable cause
- I = inventory
The Connecticut state patrol created another website (link), where new data will get uploaded going forward. We haven't processed this yet.

Hartford, CT

2013-10-13 to 2016-09-29

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	98.9%
lng	98.9%
district	93.3%
subject_age	100.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	86.3%
contraband_found	99.9%
search_conducted	100.0%
search_vehicle	100.0%
search_basis	99.9%
reason_for_stop	100.0%
raw_subject_race_code	100.0%
raw_subject_ethnicity_code	100.0%
raw_search_authorization_code	100.0%
raw_intervention_disposition_code	100.0%

Data notes:

Data is deduplicated on raw columns InterventionDateTime, ReportingOfficerIdentificationID, InterventionLocationDescriptionText, SubjectRaceCode, SubjectSexCode, and SubjectAge, reducing the number of rows by ~1.1%
search rate is suspiciously high, ~28%
hit rate is suspiciously low, ~1%; we exclude Hartford from outcome and threshold tests because contraband recovered is so suspiciously low that we don't trust it, plus it's so low that it's not even enough data to run the statistical tests reliably.
subject_race is based on SubjectEthnicityCode and SubjectRaceCode, which are based on raw_subject_ethnicity_code and raw_subject_race_code
search_conducted and search_basis are derived from SearchAuthorizationCode, which is passed through as raw_search_authorization_code
outcomes are based on InterventionDispositionCode, which is passed through as raw_intervention_disposition_code
2013 and 2016 have only partial data

Tampa, FL

1973-06-21 to 2018-03-07

feature	coverage rate
date	100.0%
subject_age	99.8%
subject_dob	99.7%
subject_race	100.0%
subject_sex	100.0%
officer_first_name	93.3%
officer_last_name	93.3%
department_name	100.0%
type	100.0%
violation	100.0%
citation_issued	100.0%
outcome	100.0%
vehicle_registration_state	97.6%
raw_race	100.0%

Data notes:

Data is deduplicated on date, subject_race, subject_dob, officer_last_name, officer_first_name, and Driver License Number, reducing the number of rows by ~13.2%; it's possible this slightly over-deduplicates, if an officer pulls over the same person in the same day
Data is missing search and contraband information, as well as outcomes other than citations
Hispanic race data is likely underreported, given that ACS 2017 5-year estimates suggest Hispanic individuals make up ~25% of the population, but only ~4% of stops in Tampa
The data sources are public (it's unclear what the difference is between the stop types):
- https://publicrec.hillsclerk.com/Traffic/Civil_Traffic_Name_Index_files/
- https://publicrec.hillsclerk.com/Traffic/Criminal_Traffic_Name_Index_files/
subject_race is based on Race which is passed through as raw_race

Saint Petersburg, FL

2010-01-01 to 2010-07-29

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	89.7%
lng	89.7%
district	98.4%
officer_id	100.0%
officer_id_hash	100.0%
type	100.0%

Data notes:

Only 7 months of data provided
No demographic, search/contraband, or outcome data

Statewide, FL

2010-01-01 to 2018-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	16.9%
county_name	100.0%
subject_age	56.1%
subject_race	100.0%
subject_sex	56.3%
officer_id	100.0%
officer_id_hash	100.0%
officer_age	60.7%
officer_race	57.5%
officer_sex	64.1%
officer_last_name	56.3%
officer_years_of_service	68.5%
department_name	56.3%
unit	74.2%
type	100.0%
violation	94.2%
arrest_made	94.2%
citation_issued	98.0%
warning_issued	95.7%
outcome	76.4%
frisk_performed	94.2%
search_conducted	68.5%
search_basis	100.0%
reason_for_search	100.0%
reason_for_stop	94.2%
vehicle_registration_state	56.0%
notes	69.6%
raw_row_number_old	56.3%
raw_Race	100.0%
raw_Ethnicity	11.6%
raw_row_number_new	68.5%
raw_SearchType	94.2%
raw_EnforcementAction	94.2%

Data notes:

The raw data is very messy. Two different data sets were supplied, both with slightly different schemas, just for 2010 to part of 2016. A third dataset was supplied for 2016 through 2018. However, they were joined by uniquely identifying features. The second data dump goes until 2016, while the first only goes until 2015. The fields missing in the second or third data sets are thus missing for some rows.
There are many duplicates in the raw data, which we remove in two stages. First, we remove identical duplicate rows. Second, we group together rows which correspond to the same stop but to different violations or passengers.
The original data has a few parsing errors, but they don't seem important as they are spurious new lines in the last 'Comments' field.
The Florida PD clarified to us that both UCC Issued and DVER Issued in the raw_EnforcementAction column indicated citations, and we consequently coded them as such.
subject_race was mapped from raw_Ethnicity and raw_Race (the different data sets have different practices in terms of recording Hispanic in race vs ethnicity fields).
raw_SearchType was used to conclude search_conducted and search_basis.
While there is some data on whether items were seized, it is not clear if these are generally seized as a result of a search, and we thus do not define a contraband_found column for consistency with other states.
raw_EnforcementAction and notes were used to determine outcome.

Statewide, GA

2012-01-01 to 2016-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	34.9%
lat	99.1%
lng	99.1%
county_name	100.0%
subject_race	52.9%
subject_sex	96.1%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
violation	100.0%
outcome	100.0%
vehicle_color	98.1%
vehicle_make	99.0%
vehicle_model	96.0%
vehicle_year	95.2%
raw_race	52.9%

Data notes:

The data represent warnings.
The provided .txt was comma-separated, but not quoted. Therefore we had to write a script (convert_GA.py) to iron out some obviously misaligned columns.
Rows represent individual warnings, and thus need to be aggregated to represent a single stop.
The race field on the warnings form is optional; we have only about 50% race coverage, so GA is omitted from all analyses.
subject_race was mapped from raw_race.

Statewide, IA

2006-01-01 to 2016-04-25

feature	coverage rate
date	100.0%
time	84.7%
location	89.2%
county_name	4.6%
subject_age	39.3%
subject_race	26.0%
subject_sex	39.2%
officer_id	57.8%
officer_id_hash	57.8%
department_name	57.8%
type	85.2%
violation	92.5%
citation_issued	84.4%
warning_issued	84.4%
outcome	84.4%
vehicle_color	53.8%
vehicle_make	54.8%
vehicle_model	52.7%
vehicle_registration_state	38.7%
vehicle_year	38.3%

Data notes:

The data separates warnings and citations. They are very different with respect to which fields they have available. Both contain duplicates. This happens when individuals receive more than one warning or citation within the same stop. We remove these by grouping by the remaining fields by the stop key and date.
In some cases, there are multiple time stamps per unique (key, date) combination. In most of these cases, the timestamps differ by a few minutes, but all other fields (except for violation) are the same. In 0.1% of stops, the max span between timestamps is more than 60 minutes. In those cases it looks like the same officer stopped the same individual more than once in the same day.
Only citations have Ethnicity, which only provides information on whether the driver is Hispanic. We therefore exclude Iowa from our main analysis because race data is lacking.
Only (some) citations have county, the warnings only have trooper district. The mapping for the districts is provided in the resources folder. Counties were mapped by comparing the identifiers in the LOCKCOUNTY field with the cities in the LOCKCITY field.
The codes in the county field represent counties ordered alphabetically.
Additional columns in the raw data that may be of interest: EQUIPVIOL (free field -- usually a two-digit code, but some text descriptions of the violation), SCHEDULEDFINE, SURCHARGE, TOTALCOST, along with a bunch of columns that are >=99.9% NA, many of which have prefix "DISP" and pertain to data that would happen post-arrest. )

Idaho Falls, ID

2008-08-13 to 2016-07-25

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	84.3%
lng	84.3%
neighborhood	93.9%
division	59.8%
subdivision	100.0%
zone	92.3%
officer_id	100.0%
officer_id_hash	100.0%
type	100.0%
disposition	31.8%

Data notes:

Race and gender are not on the ID driver's license and filled in only rarely, subject age is also 100% null
There is 'reptspec' data, but the values are extrenely vague, i.e. "PAST", "SATURATION", "PERSON", "OTHER AGENCY",
There are 6 more months of data unprocessed with the main files since they are of a completely different format, but are available upon request
The data is missing demographic information as well as search/contraband information
It's unclear whether there are duplicates, since officerid is 0 sometimes and there is no demographic information

Statewide, IL

2012-01-01 to 2017-12-31

feature	coverage rate
date	100.0%
time	99.9%
location	99.0%
beat	98.2%
subject_age	99.9%
subject_yob	100.0%
subject_race	100.0%
subject_sex	100.0%
department_id	100.0%
department_name	100.0%
type	100.0%
violation	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
contraband_found	99.2%
contraband_drugs	100.0%
contraband_weapons	100.0%
search_conducted	99.9%
search_person	99.9%
search_vehicle	99.9%
search_basis	32.7%
reason_for_stop	100.0%
vehicle_make	97.4%
vehicle_year	99.9%
raw_DriverRace	100.0%
raw_ReasonForStop	100.0%
raw_TypeOfMovingViolation	100.0%
raw_ResultOfStop	100.0%

Data notes:

The data is very messy. The presence and meaning of fields relating to search and contraband vary year by year. Caution should be used when inspecting search and hit rates over time. We exclude Illinois from our time trend marijuana analysis for this reason.
We only process statewide data from 2012 to 2017. We received data back to 2004, but chose not to process it due to format issues and relevance.
For state patrol stops, there is mostly no information on the county of the stop. Instead, stops are mapped to districts (see the district column), which have a one-to-many relationship with counties. See the relevant map here. There is one district (#15) with a lot of stops that does not directly map to counties, as it refers to stops made on the Chicago tollways. We use districts in our analysis.
Counties for local stops could be mapped by running the police departments in the AgencyName field through Google's geocoder.
The search_type_raw field is occasionally "Consent search denied", when a search was conducted. This occurs because the search request might be denied but a search was conducted anyway. Many searches have missing search type data, so we do not rely on search_basis when analyzing Illinois searches.
Race (raw_DriverRace) mapping:
- 1 = White
- 2 = Black
- 3 = American Indian or Alaska Native
- 4 = Hispanic
- 5 = Asian
- 6 = Native Hawaiian or Other Pacific Islander
Outcome (raw_ResultOfStop) mapping:
- 1 = Citation
- 2 = Written Warning
- 3 = Verbal Warning (stop card)
We also pull through raw columns raw_ReasonForStop and raw_TypeOfMovingViolation to populate the reason_for_stop and violation columns in the clean data. We received dictionaries to help do so.
Note that IL contains state patrol and municipal police departments, but we use only the state patrol data in our anlaysis. There are occasional issues with some of the municipal P.D. data to watch out for: for example, the search and contraband data is fairly detailed and robust, except for Chicago Police, which has lots of NAs for search info (in 2012-2013) and lots of NAs for contraband info (in 2014). We do not alter these NA values, but recommend looking more closely into the Chicago city data (see below) rather than using the data given to us through the state records request.
Additional columns in the raw data that may be of interest: IL has really detailed search/contraband information. There are about 40 raw columns with search/contraband info; they fall into four categories Vehicle*, Driver*, Passenger*, and PoliceDog*, where * delineates things like what type of contraband was found or how much contraband was found, whether consent was requested, whether consent was given, who performed the search, etc.

Chicago, IL

2012-01-01 to 2020-05-16

feature	coverage rate
date	100.0%
time	100.0%
location	96.4%
lat	94.9%
lng	94.9%
geocode_source	94.9%
beat	59.9%
district	59.9%
subject_age	69.8%
subject_race	68.5%
subject_sex	100.0%
officer_id	7.2%
officer_id_hash	7.2%
officer_age	2.8%
officer_race	37.3%
officer_sex	37.3%
officer_first_name	37.3%
officer_last_name	37.3%
officer_years_of_service	37.2%
unit	59.9%
type	100.0%
violation	100.0%
arrest_made	10.1%
citation_issued	30.4%
outcome	40.1%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
contraband_alcohol	100.0%
contraband_other	0.6%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
vehicle_make	59.9%
vehicle_model	59.9%
raw_race	69.9%
raw_driver_race	0.6%

Data notes:

Dataset is created by joining arrests and citations on date, hour, officer name, and location
There may be duplicates, but there is often insufficient information to deduplicate, i.e. the time resolution is hourly driver_race is null 99% of the time, and officer ID appears to be only a prefix of the full ID ~0.3% of the time
Data includes warnings and arrests, but is missing warnings
violation represents statute_description in the raw data
subject_race is based on raw columns race and driver_race, which are passed through

2023 update:

1,261,642 new stops added.

Fort Wayne, IN

2007-09-01 to 2017-09-30

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	97.0%
lng	97.0%
officer_first_name	99.8%
officer_last_name	99.9%
type	100.0%
disposition	99.4%
arrest_made	99.4%
citation_issued	99.4%
warning_issued	99.4%
outcome	66.3%

Data notes:

Roster.csv (police officer info) is available in raw data, but doesn't join cleanly to stops data; first names are often truncated and nicknames are used, i.e. Manny vs Manuel; it can be loaded and reviewed upon request.
Data is missing search/contraband information, as well as demographic information
disposition represents Description in the raw data; outcomes are derived from this column

Wichita, KS

2006-01-01 to 2020-05-15

feature	coverage rate
date	100.0%
time	100.0%
location	97.4%
lat	97.3%
lng	97.3%
geocode_source	97.3%
subject_age	81.4%
subject_race	93.4%
subject_sex	78.7%
subject_first_name	90.0%
subject_last_name	90.7%
officer_first_name	99.5%
officer_last_name	99.5%
type	100.0%
disposition	96.5%
violation	100.0%
citation_issued	100.0%
outcome	100.0%
posted_speed	31.2%
vehicle_color	94.2%
vehicle_make	95.5%
vehicle_model	49.5%
vehicle_year	27.4%
raw_defendant_race	93.4%
raw_defendant_ethnicity	62.2%

Data notes:

Data is deduplicated on raw columns citation_date_time, citation_location, defendant_first_name, defendant_last_name, defendant_age, defendant_sex, and defendant_race, resulting in ~4.1% fewer records
Data is missing search/contraband fields
citation_number in the raw data doesn't appear to be unique. i.e. citation "07M000645" is associated with two different dates, locations, and people
Only citations are included
violation represents charge_description in the raw data
disposition represents charge_disposition in the raw data
subject_race is based on the raw columns defendant_ethnicity and defendant_race, which are passed through

2023 update:

100,660 new stops added.

Louisville, KY

2015-01-01 to 2020-06-16

feature	coverage rate
date	100.0%
time	100.0%
location	99.9%
lat	92.2%
lng	92.2%
geocode_source	92.2%
beat	96.0%
division	96.0%
subject_age	76.4%
subject_race	99.9%
subject_sex	100.0%
officer_race	99.9%
officer_sex	99.9%
type	100.0%
violation	76.5%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_basis	100.0%
reason_for_search	98.9%
raw_activity_division	95.9%
raw_division	72.4%
raw_activity_beat	95.9%
raw_beat	72.5%
raw_driver_race	99.9%
raw_persons_race	76.3%
raw_persons_ethnicity	73.8%
raw_driver_age_range	100.0%
raw_was_vehcile_searched	100.0%
raw_citation_location	76.5%

Data notes:

While we have raw csvs for all citations, we keep only those records that join onto the stops data; the source of this data is here: https://data.louisvilleky.gov/dataset/uniform-citation-data
Data is deduplicated on raw columns officer_gender, officer_race, officer_age_range, activity_date, activity_time, activity_location, activity_division, division, activity_beat, beat, driver_gender, persons_sex, driver_race, persons_race, persons_ethnicity, driver_age_range, person_age, persons_home_city, persons_home_state, person_home_zip, reducing the number of rows by ~%
subject_race is based on the raw column driver_race, since it is null 0.03% of the time compared to 18.6% for persons_race and 18.60% for persons_ethnicity; all are passed through with raw_ prefix
violation represents raw column charge_desc
All stops are not null for at least one of the driver_* columns or number_of_passengers or was_vehicle_searched columns, implying all stops are vehicular
location used for geocoding is activity_location, which had a lower null rate than citation_location, but the latter was passed through as raw_citation_location
subject_age is based on persons_age from the citation data, although it is null more often than driver_age_range; the latter, however, only gives a range, so couldn't be use for this column; it is passed through though as raw_driver_age_range
search_conducted is based on was_vehcile_searched, which is passed through as raw_was_vehcile_searched (sic); there were 3 NAs that were coerced to false under the assumption that the officers may simply not have recorded the absence of a search
search_basis was based on reason_for_search; k9 searches matched the pattern "K9|K-9|DOG", plain view searches matched anything mentioning plain view/smell or anything that could be seen in plain sight and matched the following pattern "BAGGIES|DRUGS|GUN|MARIJUANA|ODOR|PILLS|PIPE|PLAIN VIEW|SMELL", consent matched "CONSENT|CONSE", probable cause matched "PROB|P/C|PC|P.C.", and everything else was classified as "other"; this was verified to be accurate for 99.8% of entries; the long tail was not checked, but anyone viewing the data can see the original values in the reason_for_search column
data is lacking explicit contraband information, but some of this can be inferred from reason_for_search
frisk_performed is true with reason_for_search matches the pattern "TERRY|PAT", it is false otherwise (NA and no match)
2018 has data only from January

2023 update:

35,601 new stops added.

Owensboro, KY

2015-09-01 to 2017-09-01

feature	coverage rate
date	100.0%
time	100.0%
location	99.9%
lat	100.0%
lng	99.9%
sector	99.9%
subject_age	100.0%
subject_dob	99.9%
subject_race	99.7%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
type	99.4%
violation	99.9%
arrest_made	100.0%
citation_issued	100.0%
outcome	100.0%
vehicle_registration_state	99.2%
raw_race	99.7%

Data notes:

There is a list_of_officers.csv as well as the excel spreadsheet (preferable given the formatting) that have more officer information available upon request
Data is missing search/contraband information
Data is all citations, although it appears to include an arrest indicator as well, when that also occurred
Provided longitude is lacking the negative sign, which we add (without which all points are in central China)
subject race is based on RACE in the raw data and passed through as raw_race; data does not include Hispanic.
violation is a concatenation of Violation Description X where X is 1 to 9
type is based on Violation Description 1
2015 and 2017 only have data for part of the year

New Orleans, LA

2010-01-01 to 2018-07-18

feature	coverage rate
date	100.0%
time	100.0%
location	81.3%
lat	50.8%
lng	50.8%
district	100.0%
zone	100.0%
subject_age	97.5%
subject_race	97.7%
subject_sex	97.7%
officer_assignment	100.0%
type	70.7%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	65.5%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	100.0%
reason_for_stop	100.0%
vehicle_color	53.3%
vehicle_make	54.0%
vehicle_model	50.6%
vehicle_year	53.1%
raw_actions_taken	76.1%
raw_subject_race	97.7%

Data notes:

Data is deduplicated on EventDate, BlockAddress, and SubjectID, which reduces the number of rows by ~0.07%
Addresses were partially anonymized by the department replacing the last two numbers of the address number with XX; these were replaced with 00 so we could at least geocode the block level address
search_conducted is true when the ActionsTaken includes "Search Occurred: Yes", and it's false when that is not present or the ActionsTaken column is NA, under the assumption that NA is equivalent to "Stop Results: No action taken"
reason_for_stop is StopDescription in the raw data; type is based on this column
outcomes, search, and contraband fields are all based on the ActionsTaken column, which is passed through as raw_actions_taken; NA in this column is assumed to be 'no actions taken'
subject_race is based on SubjectRace raw column, which is passed through as raw_subject_race
data before 2010 is sparse and unreliable so it is removed from the clean dataset
2018 only has partial data

Statewide, MA

2007-01-01 to 2015-12-31

feature	coverage rate
date	100.0%
location	99.8%
county_name	99.8%
subject_age	95.4%
subject_race	100.0%
subject_sex	99.5%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	99.8%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
contraband_alcohol	100.0%
contraband_other	1.6%
frisk_performed	1.6%
search_conducted	100.0%
search_basis	91.6%
reason_for_stop	51.4%
vehicle_type	99.9%
vehicle_registration_state	99.7%
raw_Race	100.0%

Data notes:

The search and outcome fields are inconsistent. We take the most progressive interpretation: if one of SearchYN, SearchDescr or the outcome columns indicates that there was a search, we label them as such.
While we define a contraband_found column in case it is useful to other researchers, it is sufficiently messy (there are multiple ways you might define contraband_found, and they are quite inconsistent) that we exclude it from our contraband analysis.
In <1% of the data, RsltSrchNo and RsltSrch<contraband type> conflict. In these cases, we use the value from RsltSrchNo.
Violation data is not very granular.
Counties were mapped by running the cities in the CITY_TOWN_NAME field through Google's geocoder.
There are only a handful of stops in the data before 2007; we drop those years as they are clearly unreliable. It appears that the first few months (nearly half) of 2007 are also incomplete, but we have not attempted to remove the incomplete months.
subject_race was mapped from raw_Race
Additional columns in the raw data that may be of interest: SpecialEvent (GHSB Speed Detail, Road Block, Blue Blitz; 99% NA), PlateReader (boolean), OwnTruckPass (O, W, T, P)

Baltimore, MD

2011-01-01 to 2017-12-30

feature	coverage rate
date	100.0%
time	98.7%
beat	63.1%
district	60.1%
officer_id	89.9%
officer_id_hash	89.9%
type	97.2%
citation_issued	100.0%
outcome	100.0%

Data notes:

Data is missing search/contraband information as well as demographic information and outcomes other than citations
The primary key seems to be a combination of Ticket and Citation Number; when Ticket is null, Citation Number isn't and vice versa; both are duplicated across rows, so we deduplicate on those two IDs coalesced, resulting in ~0.01% fewer records
Data lacks translations for Ordinance Code and Citation Type
Violation data is almsot all null

Statewide, MD

2007-01-01 to 2014-03-31

feature	coverage rate
date	97.8%
time	23.0%
location	23.0%
subject_age	22.9%
subject_dob	22.9%
subject_race	99.6%
subject_sex	98.8%
department_name	100.0%
type	100.0%
disposition	1.9%
violation	0.1%
arrest_made	94.0%
citation_issued	93.2%
warning_issued	93.2%
outcome	78.7%
contraband_found	82.0%
contraband_drugs	99.6%
contraband_weapons	98.5%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	14.8%
reason_for_arrest	87.9%
reason_for_search	98.5%
reason_for_stop	99.2%
raw_Race	99.6%
raw_Outcome	93.2%
raw_Arrest_Made	28.8%

Data notes:

The data is very messy. It comes from three different time periods: 2007, 2009-2012, 2013-2014. They all have different column and slightly different conventions of how things are recorded. We attempted to standardize the fields as much as possible.
Time resolution of the data varies by year. Prior to 2013, data is reported annually. From 2013 onward, data is reported daily. So stop dates prior to 2013 are not precise to the nearest day and are just reported as Jan 1.
Counties could theoretically be mapped by running the police departments in the Agency field through Google's geocoder, but this does not work for state patrol stops, for which we have no county information. Maryland's data is not good enough for us to include in our analysis, so we chose not to do this.
subject_race is mapped from raw_Race.
outcome and arrest_made are mapped from raw_Outcome and raw_Arrest Made; see processing script for details.
search_basis is a cleaned up version of reason_for_search which is a free field populated by raw column Search Reason.
Prior to 2013, there are quite a few NAs for contraband; we do not cast these to false because it seems to be too many to assume they're all false -- it feels more believable that there is actual missing data in these annually reported, messy datasets.
Additional columns from the raw data that may be of interest: Duration of Search

Statewide, MI

2001-07-06 to 2016-05-09

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
county_name	100.0%
subject_race	97.8%
officer_id	100.0%
officer_id_hash	100.0%
department_id	100.0%
department_name	100.0%
type	100.0%
violation	99.9%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
reason_for_stop	100.0%
speed	30.5%
posted_speed	30.5%
charged_speed	91.7%
raw_Race	97.8%

Data notes:

The original data had some unquoted fields (VoidReason and Description) which had commas in them. We manually fixed these with a python script, which can be found in the /scripts folder.
Driver race data has more than 50% missing data, so we excluded Michigan from the analysis in the paper.
The codes in the CountyCode field represent counties ordered alphabetically.
Rows represent violations, not stops, so we remove duplicates by grouping by the other fields.
Michigan data has loads of additional columns, a cluster we find very interesting are SpeedPosted (pulled through as posted_speed), SpeedDetected (pulled through as speed), and SpeedCharged (pulled through as charged_speed). Most places with speeding information give just speed and posted speed; analyses like the bunching analysis try to infer the true speed, and whether drivers of different races were discounted at different rates. Michigan's transparency about discounting (from detected speed to charged speed) could make this process much easier to analyze. However, we do not do so because race information is insufficient.
Since all rows have a TicketNum, we assume that if any ticket is not a warning, then it is a citation. But then potentially for outcome, anything that is not an arrest or warning could have a court summons. It's possible raw data columns Felony, Misdemeanor, CivilInfraction could help disambiguate.
Additional raw data columns that may be of interest: Michigan has over 160 columns in the raw data, though many of them are >99.9% NA. There are ID columns for everything from violation codes, citation codes, infraction codes, incident numbers, court code, etc. Other columns: VehicleImpounded, Injury, Felony, Misdemeanor, CivilInfraction.

Saint Paul, MN

2001-01-01 to 2016-12-13

feature	coverage rate
date	100.0%
time	100.0%
lat	100.0%
lng	100.0%
police_grid_number	100.0%
subject_age	13.2%
subject_race	82.5%
subject_sex	84.2%
type	100.0%
citation_issued	100.0%
outcome	13.9%
frisk_performed	100.0%
search_conducted	100.0%
search_vehicle	100.0%
raw_race_of_driver	100.0%

Data notes:

Data is deduplication on DATE OF STOP, RACE OF DRIVER, AGE OF DRIVER, GENDER OF DRIVER, and POLICE GRID NUMBER, resulting in ~0.02% fewer records
Data is lacking contraband and location information
If a citation was not issued, it's unclear whether a warning was issued or something else
subject_race is based on RACE OF DRIVER in the raw data, which is passed through as raw_race_of_driver
search_conducted is based on VEHICLE SEARCHED?; "No Data" is assumed to be false because it is likely that "No Data" is an autofill value for NA, which we coerce to false elsewhere under the assumption that officers sometimes don't record the absence of a search; the same is done for frisk_performed

Statewide, MO

2010-01-01 to 2015-01-01

feature	coverage rate
date	100.0%
location	100.0%
subject_race	100.0%
department_name	100.0%
type	100.0%
contraband_found	100.0%
search_conducted	100.0%
raw_race	100.0%

Data notes:

The original data was aggregated. There is detail on a number of fields (age, stop purpose, outcome) that is not usable as it is not cross-tabulated with the other fields.
Because this is aggregate data, stop date is only precise to the nearest year, and is recorded as Jan 1 for all stops.
Note that the location column comes from the department's work location, which is coarse; and highway patrol stops thus all get mapped to Jefferson City.

Statewide, MS

2013-01-01 to 2016-07-27

feature	coverage rate
date	100.0%
county_name	99.4%
subject_age	100.0%
subject_dob	99.9%
subject_race	99.9%
subject_sex	99.9%
department_id	100.0%
department_name	99.2%
type	100.0%
violation	100.0%
speed	34.9%
posted_speed	34.9%
raw_race	99.9%

Data notes:

Counties were mapped using the dictionary provided, which is added to the raw data folder. Counties are numbered alphabetically.
There is no data on Hispanic drivers, so we exclude Mississippi from our main analysis.
subject_race was mapped from raw_race.
violation was populated with raw column acd.
Additional columns in the raw data that may be of interest: acdoos and aamva have alpha-numeric codes like acd (i.e., violation in the clean data), acdsev (0-3; NA for 65%), acc (boolean), court (mostly MUN or JUS), elect (E or NA), disp (G, P, D, S, N), fine.

Statewide, MT

2009-01-01 to 2017-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	99.6%
lat	100.0%
lng	100.0%
county_name	100.0%
subject_age	99.6%
subject_race	100.0%
subject_sex	100.0%
department_name	100.0%
type	100.0%
violation	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_basis	97.1%
reason_for_stop	100.0%
vehicle_make	99.5%
vehicle_model	97.3%
vehicle_type	92.7%
vehicle_registration_state	96.3%
vehicle_year	99.1%
raw_Race	100.0%
raw_Ethnicity	100.0%
raw_SearchType	100.0%
raw_search_basis	1.8%

Data notes:

subject_race was mapped from raw_Ethnicity and raw_Race.
search_conducted and search_basis were mapped from raw_SearchType.
violation is a concatenation of Violation[1-3] from the raw data.
stop_outcome is derived from raw columns EnforcementAction[1-3], see processing script for details.
reason_for_stop is populated from raw column ReasonForStop.
Additional columns in the raw data that may be of interest: VehicleIsCommercial, VehicleIsMotorcycle, ViolationDescription (which gives a bit more detail than the violation columns we pull through into the clean data), ViolationUnlawfulSpeed (boolean), AggressiveDriving (boolean), FaultyOtherDescription (free field description of equipment violations), WarningOtherViolations[1,2] (free field description of warning), WarningsThisRecord (0-3, indicating how many warnings were given), CitationsThisRecord (0-3 indicating how many citations were given), EnforcementAction[1-3] (gives slightly more detail than stop_outcome, e.g., misdemeanor arrest vs felony arrest).

2023 update:

96,109 new stops added.

Raleigh, NC

2002-01-01 to 2015-12-31

feature	coverage rate
date	100.0%
time	99.9%
location	100.0%
county_name	100.0%
subject_age	100.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	97.8%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	98.4%
reason_for_frisk	0.1%
reason_for_search	100.0%
reason_for_stop	100.0%
raw_Ethnicity	100.0%
raw_Race	100.0%
raw_action_description	100.0%

Data notes:

Data is pulled out of Statewide, NC data, so refer to that for processing documentation
Missing data 2/2004, 2/2005, 5/2005, 10/2005, 11/2005, 3/2006, 8/2006, 4/2007, 11/2008, 1/2009, 11/2012, 9/2013, 11/2013, 7/2014, 10/2014, 10/2015

Statewide, NC

2000-01-01 to 2015-12-31

feature	coverage rate
date	100.0%
time	49.7%
location	99.7%
county_name	97.9%
subject_age	100.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	97.0%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	96.4%
reason_for_frisk	0.1%
reason_for_search	100.0%
reason_for_stop	100.0%
raw_Ethnicity	100.0%
raw_Race	100.0%
raw_action_description	100.0%

Data notes:

Stop time is often unreliable — we have a large overdensity of 00:00 values, which we set to NA.
Attempting to deduplicate on StopDate, OfficerId, StopLocation, StopCity, PersonID, Age, Gender, Ethnicity, and Race reduced rows by 0%, i.e. there do not appear to be duplicates
The location of the stop is recorded in two different ways. Some stops have a county code, which can be mapped using the provided dictionary, which is included in the raw data. Other stops are only labeled with the state patrol district. Some districts map directly onto counties, in which case we label the stop with that county. However, some districts cover multiple counties. Stops in these districts can thus not be unambiguously mapped to a single county. In both cases, district of the stop is provided in the "district" column, providing coarse location data for the vast majority of stops.
Action is sometimes "No Action" or a similarly minor enforcement action even when DriverArrest or PassengerArrest is TRUE. In these cases, we set outcome to be "Arrest" because the outcome field represents the most severe outcome of the stop.
There can be multiple search bases per stop-search-peron, so we collapse them into a single value
There is a 1:N correspondence between StopID and PersonID, so we filtered out passengers when joining demographic information to stop data to prevent duplicates; this also means that the demographic information pertains to the driver
When joining search data onto the stop data, the data is joined by StopID only and not also PersonID, since the person searched could be either the driver or passenger; this means that the search data may be of either the driver or the passenger, and in 3.6% of cases, it was actually the passenger who was searched, but search_conducted is true in either case; fortunately, there is a 1:1 correspondence between StopID and and SearchID, as well as between SearchID and PersonID (who, again, can be either the driver or passenger) and SearchID and ContrabandID
subject_race is based on Ethnicity and Race, which are passed through as raw_*
outcomes are based on raw_action_description, which is based on the raw column Action and translated given the provided codes
frisk and search data is based on SearchID and search_type_description, which is passed through with raw_*; the latter is based on the raw column SearchType and translated using the given data dictionary
stop_purpose_description is based on raw column StopPurpose and is translated using the given data dictionary and passed through as reason_for_stop
reason_for_search represents the raw column Basis
Additional columns in the raw data that may be of interest: Ounces, Pounds, Kilos, Grams, Dosages, Weapons provide greater resolution on contraband; Gallons, Pints, Money, DollarAmt may also do so; EncounterForce (boolean), EngageForce (boolean); [Officer,Driver,Passenger]Inury (all booleans)

Winston-Salem, NC

2000-01-11 to 2015-12-31

feature	coverage rate
date	100.0%
time	78.6%
location	100.0%
county_name	100.0%
subject_age	100.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	97.9%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	99.0%
reason_for_frisk	0.0%
reason_for_search	100.0%
reason_for_stop	100.0%
raw_Ethnicity	100.0%
raw_Race	100.0%
raw_action_description	100.0%

Data notes:

Data is pulled out of Statewide, NC data, so refer to that for processing documentation
Missing data 8/2014, 1/2015, 2/2015, and 5/2015

Greensboro, NC

2000-01-04 to 2015-12-31

feature	coverage rate
date	100.0%
time	99.1%
location	100.0%
county_name	99.8%
subject_age	100.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	97.2%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	97.7%
reason_for_frisk	0.1%
reason_for_search	100.0%
reason_for_stop	100.0%
raw_Ethnicity	100.0%
raw_Race	100.0%
raw_action_description	100.0%

Data notes:

Data is pulled out of Statewide, NC data, so refer to that for processing documentation
Missing data 8/2015, 11/2015, 11/2016, and 3/2014

Durham, NC

2001-12-28 to 2015-12-31

feature	coverage rate
date	100.0%
time	85.2%
location	100.0%
county_name	100.0%
subject_age	100.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	96.7%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	96.4%
reason_for_frisk	0.2%
reason_for_search	100.0%
reason_for_stop	100.0%
raw_Ethnicity	100.0%
raw_Race	100.0%
raw_action_description	100.0%

Data notes:

Data is pulled out of Statewide, NC data, so refer to that for processing documentation
Missing data from 2008-2013:
- 2008 missing January data
- 2009 missing February, April, July, September, October, December
- 2010 missing February, November
- 2013 missing May

Fayetteville, NC

2000-01-07 to 2015-12-31

feature	coverage rate
date	100.0%
time	96.8%
location	100.0%
county_name	100.0%
subject_age	100.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	97.5%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	95.3%
reason_for_frisk	0.2%
reason_for_search	100.0%
reason_for_stop	100.0%
raw_Ethnicity	100.0%
raw_Race	100.0%
raw_action_description	100.0%

Data notes:

Data is pulled out of Statewide, NC data, so refer to that for processing documentation

Charlotte, NC

2000-01-01 to 2015-12-31

feature	coverage rate
date	100.0%
time	99.9%
location	100.0%
county_name	99.9%
subject_age	100.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	95.5%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	98.2%
reason_for_frisk	0.1%
reason_for_search	100.0%
reason_for_stop	100.0%
raw_Ethnicity	100.0%
raw_Race	100.0%
raw_action_description	100.0%

Data notes:

Data is pulled out of Statewide, NC data, so refer to that for processing documentation

Grand Forks, ND

2007-01-01 to 2016-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	96.8%
lat	93.5%
lng	93.5%
subject_race	99.0%
subject_sex	100.0%
type	63.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
reason_for_stop	100.0%
raw_race	99.0%

Data notes:

Data is deduplicated on raw columns agency, date, time, sex, race, age, ht_ft, ht_in, house, and street, reducing the number of records by ~14.2%
Many of the offenses fall into categories other than obvious pedestrian or vehicular stops, i.e. BARKING DOG, and are encoded as NA for type, but the description is provided in reason_for_stop
The department says that arrest, search, and contraband are not recorded with stop data
There are unidentified spikes that are relatively large every year in late May or early June, i.e. 2010-05-08, 2011-06-02, 2012-05-05, 2013-05-04, 2014-05-10, 2015-05-09, 2016-05-20; it's unclear what these correspond to and the PD has not yet responded to our inquiry
subject_race is based on raw_race, which is passed through; the data does not appear to include Hispanic.

Statewide, ND

2010-01-01 to 2015-06-25

feature	coverage rate
date	100.0%
time	100.0%
location	99.8%
county_name	100.0%
subject_age	99.9%
subject_race	100.0%
subject_sex	100.0%
type	99.2%
violation	100.0%
outcome	100.0%
raw_Race	100.0%

Data notes:

The data contain records only for citations, not warnings.
Rows represent individual citations, not stops, so we remove duplicates by grouping by the other fields.
The violation field is populated by citation codes and their descriptions.
subject_race is mapped from raw_Race.
Note that deduping by violation_date_time, Age, sex, Race, county_name, street_cnty_rd_location, desc_of_area, highway, ref_point reduces rows by ~16.6%.

Statewide, NE

2002-01-01 to 2016-10-01

feature	coverage rate
date	100.0%
county_name	47.7%
subject_race	100.0%
department_name	100.0%
type	100.0%
search_conducted	100.0%
raw_dept_lvl	100.0%
raw_dept	100.0%
raw_Race	100.0%

Data notes:

The original data was aggregated. It was grouped by stop reason, outcome and whether there was a search separately. Therefore, it is not possible to cross tabulate them together. We only use the last grouping.
State and local stops are mixed together, identifiable by the raw_dept_lvl field. We map levels 1, 5, 9, 10, 11 to "Nebraska State Agency" in the deparment_name field; for the other levels, we fill department_name with raw_dept. Note that levels 1, 2, and 3 are state patrol, local P.D. and sheriff P.D.s, respectively; levels 5-12 are special agencies or sectors of some sort; there are no stops for level 4.
The data is by quarter, not by day. So all stop_dates are the first date of the quarter.
There is a strange jump (Q1) and then dip (Q2–4) in the data for 2012. This stems from all state patrol stops for 2012 being recorded as happening in the first quarter. Municipal departments seem to have okay dated data for 2012.

Statewide, NH

2014-01-01 to 2015-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	87.7%
lng	87.7%
county_name	100.0%
subject_age	52.9%
subject_dob	52.9%
subject_race	63.9%
subject_sex	98.4%
type	100.0%
violation	94.1%
citation_issued	100.0%
warning_issued	100.0%
outcome	94.3%
raw_RACE_CDE	64.0%
raw_CITATION_RESPONSE_DSC	100.0%

Data notes:

The driver_race field was populated by hand-written codes that we manually decoded. They are prone to mislabeling and should be used with caution only. Also, a very high percentage of stops (>30%) are missing race data entirely. We map the most common codes, covering more than 99% of stops with data, but we do not interpret the long tail of misspellings because many of them are ambiguous, we do not want to make assumptions, and it does not significantly improve the data. We exclude this dataset from our analysis because it has too much missing race data.
We determine stop outcome (citation, warning, etc) using raw_CITATION_RESPONSE_DSC, and we determine subject_race from raw_RACE_CDE.
The driver_age field was not populated for the 2014.2 dataset.
Rows represent violations, not stops, so we remove duplicates by grouping by the other fields.

Statewide, NJ

2009-01-01 to 2016-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
subject_race	3.9%
subject_sex	99.5%
officer_id	100.0%
officer_id_hash	100.0%
department_id	100.0%
type	100.0%
violation	77.0%
arrest_made	3.3%
citation_issued	77.8%
warning_issued	77.7%
outcome	78.0%
contraband_found	95.6%
frisk_performed	3.6%
search_conducted	3.7%
vehicle_color	97.3%
vehicle_make	96.3%
vehicle_model	24.9%
vehicle_registration_state	99.3%
raw_TOWNSHIP	100.0%
raw_RACE	100.0%
raw_Ethnicity	3.9%

Data notes:

New Jersey data may be updated: we still have a number of questions we are waiting on the state to answer.
New Jersey uses sofware produced by LawSoft Inc.. There are two sets of data: CAD (computer aided dispatch, recorded at the time of stop) and RMS (record management system, recorded later). They have almost completely disjoint fields, and only RMS records have information on searches. We believe the data from the two systems should really be joined, but according to the NJSP there is not a programmatic way to do so. Therefore, we process the CAD data fully, which appear to be the dataset which corresponds to traffic stops. We did noticed that you could join the RMS file if you combine a few of the fields in a certain way. This method isn't perfect, and there are lots of nulls; but we include it in hopes that some data is better than no data.
Becuase of the above, we only know search/frisk/contraband information in about 13% of stops.
In the CAD data, there are often multiple rows per incident. Some of these are identical duplicates, which we remove. For the remaining records, we group by CAD_INCIDENT, because the NJSP told us that each CAD_INCIDENT ID refers to one stop. We verified that more than 99.9% of CAD_INCIDENT IDs had unique location and time, implying that they did, in fact, correspond to distinct events.
driver_race and driver_gender correspond to the race of the driver, not the passenger.
Statutes are mapped using the traffic code, where possible.
The CAD records contain TOWNSHIP which could be mapped to a county by running the values through the Google geocoder.
Additional raw data columns that might be of interest (note, these are only in 13% of data since they come from the spotty, impossible matching described above): Sobriety Test, CCH Check, NCIC Check, Warrant Check, Warrant. Note that since we do not have a guarantee that these 13% of rows with data are a random or representative sample, we do not recommend drawing conclusions from this information.

Camden, NJ

2013-05-01 to 2018-06-13

feature	coverage rate
date	99.9%
time	99.9%
location	98.7%
lat	98.1%
lng	98.1%
subject_age	98.9%
subject_dob	98.8%
subject_race	99.0%
subject_sex	99.9%
officer_last_name	100.0%
unit	42.7%
type	100.0%
disposition	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	47.2%
vehicle_color	71.6%
vehicle_make	71.7%
vehicle_model	70.9%
vehicle_registration_state	74.8%
vehicle_year	70.7%
raw_race	98.9%
raw_ethnicity	96.5%

Data notes:

Data is deduplicated on case_number, Incident Datetime, IncidentLocation, OfficerName, SubjectGender, Race, Ethnicity, DateOfBirth, VehicleYear, Color, Make, and Model, reducing the number of records by ~5.4%;
Data does not contain search/contraband fields
There are 3 CFS_Codes, TRAFFIC STOP, PEDESTRIAN STOP, and freeform text, which is classified as vehicular since most reference a driver or traffic stop situation
It appears as though Camden police often classify hispanics as white, since the stop rate for whites is extremely high and there are no stops for hispanics
According to the PD, a "summons" is a citation, so that corresponds to citation_issued in this data
outcomes are based on the disposition column
subject_race is based on Race and Ethnicity, which are passed through as raw_race and raw_ethnicity

Henderson, NV

2011-06-30 to 2018-01-31

feature	coverage rate
date	100.0%
time	99.8%
location	100.0%
lat	98.4%
lng	98.4%
subject_age	99.0%
subject_dob	98.9%
subject_race	97.4%
subject_sex	98.1%
officer_id	100.0%
officer_id_hash	100.0%
type	100.0%
violation	96.6%
citation_issued	100.0%
outcome	100.0%
vehicle_color	95.5%
vehicle_make	96.1%
vehicle_type	85.5%
vehicle_registration_state	96.4%
raw_race	97.4%

Data notes:

Data is deduplicated on raw columns location, city, state, zip, off_dt, off_ti, dob, ht, sex, wt, eye, hair, make, ofcr_id, reducing the total number of records by ~2.1%
violation is a concatenation of offense_1 and offense_2 in the original data, separated by "|"
Missing reason_for_stop/search/contraband information
2012 has no or very little data for July, August, and September, we have an outstanding inquiry as to why
2018 only has partial data
Data before 2011 is filtered out since 2010 data is so sparse it appears to be recording error
One of the files, Traffic Stops 01-01-11 to 05-30-18.xlsx came corrupted, we are attempting to get a clean copy of this
We assume these are all citations since the primary raw key appears to be 'cite', although we have an outstanding inquiry to confirm this
subject_race is based on raw column race, which is passed through as raw_race

Statewide, NV

2012-02-14 to 2016-05-31

feature	coverage rate
date	100.0%
subject_age	91.8%
subject_race	99.9%
type	100.0%
violation	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
raw_Race	99.9%

Data notes:

Nevada does not seem to record Ethnicity or have any records of Hispanic drivers, so we exclude it from our analysis.
Nevada does not record time of stop, making it ineligible for VOD analysis.
The violation field is a concatenation of two fields in the raw data: infraction codes and offense description.
Additional columns in the raw data that may be of interest: Citation Number.

Statewide, NY

2010-01-01 to 2017-12-14

feature	coverage rate
date	100.0%
time	100.0%
location	88.8%
county_name	100.0%
subject_age	100.0%
subject_race	100.0%
subject_sex	100.0%
type	100.0%
violation	100.0%
speed	33.4%
posted_speed	33.4%
vehicle_color	99.3%
vehicle_make	99.9%
vehicle_model	0.0%
vehicle_type	100.0%
vehicle_registration_state	97.0%
vehicle_year	98.7%
raw_RACE	100.0%

Data notes:

The data include only citations.
There is no data on searches.
The data stops at 2017-12-14.
subject_race is mapped from a raw data column which was passed through as raw_RACE.
location is simply a concatenation of three raw data columns: VIO_STREET, HWY_NUM, HWY_TYPE.
Additional columns in the raw data that may be of interest: LAW_SECTION, and DCJS_CODE (we do, however, provide violation in the clean data, which is called LAW_DESCRIPTION in the raw data, and appears to simply be the human readable description of LAW_SECTION and DCJS_CODE).

Albany, NY

2008-01-01 to 2017-12-30

feature	coverage rate
date	100.0%
time	99.6%
location	90.3%
lat	90.2%
lng	90.2%
subject_age	100.0%
subject_dob	99.9%
subject_race	67.8%
subject_sex	100.0%
type	100.0%
violation	99.5%
vehicle_color	98.7%
vehicle_make	99.2%
vehicle_registration_state	99.4%
vehicle_year	98.8%
raw_race	67.8%

Data notes:

Data is deduplicated on incident, mapinfo_lo, date, dob, sex, and race, reducing the number of records by ~28%
Search/contraband information is missing, as well as outcomes
subject_race is based on the raw column race, which is passed through as raw_race
violation represents raw column crime_code_A, which is a description of alphanumeric crime_code column

Columbus, OH

2012-01-01 to 2016-12-30

feature	coverage rate
date	100.0%
time	99.9%
location	100.0%
lat	91.8%
lng	91.8%
precinct	88.2%
zone	88.2%
subject_race	100.0%
subject_sex	100.0%
type	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
search_conducted	100.0%
reason_for_stop	100.0%
raw_enforcement_taken	100.0%

Data notes:

Incident Number in the original data seems unreliable as it has several hundred entries for 9999 and 99999; furthermore, occasionally, it does appear to reference the same incident, but is duplicated for every distinct action taken against the subject
The raw data is deduplicated on Stop Date, Contact End Date, Ethnicity, Gender, ViolationStreet, and ViolationCrossStreet, reducing the number of records by ~15.8%
search_conducted and outcome are based on Enforcement Taken, which is passed through as raw_enforcement_taken

Statewide, OH

2010-01-01 to 2017-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	100.0%
lng	100.0%
county_name	99.9%
subject_race	91.3%
subject_sex	91.3%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
violation	49.3%
arrest_made	100.0%
warning_issued	100.0%
outcome	39.8%
contraband_found	19.8%
contraband_drugs	100.0%
search_conducted	100.0%
search_basis	100.0%
raw_DISP_STRING	93.4%
raw_ORC_STRING	47.3%
raw_DISPOSITIONS	6.6%
raw_race	91.3%

Data notes:

The stop_purpose field is populated by infraction codes. The corresponding laws can be read here.
There is no data for contraband being found, but a related field could potentially be reconstructed by looking at searches involving drugs and an arrest. We mark contraband_found as TRUE for drug-related arrests (extracted from raw_ORC_STRING, but we cannot determine if the remainder are FALSE or simply some other type of contraband was recovered).
Counties were mapped using the provided dictionary, which is included in the raw data folder.
We cannot find disposition codes (in DISP_STRING) which clearly indicate whether a citation as opposed to a warning was given, although there is a disposition for warnings.
The data contains stops of both type TS and TSA, standing for "traffic stop" and "traffic stop additional". The latter have a higher search rate and tend to have additional information (i.e., ASINC_STRING is not NA). We include both types in analysis, as they do not appear to be duplicates (addresses and times do not match) and we do not have a clear reason to exclude either.
While there is data on search types, they only include consent and K9 searches, suggesting a potential difference in recording policy (many other states have probable cause searches and incident to arrest searches, for example).
officer_id refers to a single officer throughout their tenure on the state patrol, but it is re-assigned to a new trooper upon an officer's retirement.
raw_DISP_STRING is used to determine subject race, sex, stop outcome, and search information. See processing script for mappings.
Violations were mapped from raw_ORC_STRING.
2017 data has a slightly different format: information from DISP_STRING and ORC_STRING exist in raw_DISPOSITIONS for that year.
Additional columns from raw data that may be of interest: ASINC_STRING

Cincinnati, OH

2009-01-01 to 2018-05-28

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	99.9%
lng	99.9%
neighborhood	25.6%
beat	24.1%
subject_race	100.0%
subject_sex	99.9%
officer_assignment	98.5%
type	100.0%
disposition	25.3%
arrest_made	99.9%
citation_issued	99.9%
warning_issued	99.9%
outcome	80.9%
reason_for_stop	12.3%
vehicle_make	99.3%
vehicle_model	99.0%
vehicle_registration_state	98.5%
vehicle_year	99.1%
raw_race	100.0%
raw_action_taken_cid	99.9%
raw_field_subject_cid	100.0%

Data notes:

Data filters out passengers and where sex is "NON-PERSON" (i.e. business)
Data is deduplicated on instance_id, interview_date, address_x, sex, race, and age_range_cid, which reduces the number of rows by ~56%
Addresses are "sanitized", i.e. 1823 Field St. -> 18XX Field St. since 83% of given geocodes in the raw data are null, we replace X with 0 and get approximate geocoding locations
Data before 2009 is removed since it is so sparse it is likely not to be trusted, and 2018 only has partial data
reason_for_stop represents incident_type_desc in the raw data
outcomes are based on raw column actiontakencid, which is passed through as raw_action_taken_cid
type is based on field_subject_cid, which is passed through as raw_field_subject_cid
subject_race is based on race, which is passed through as raw_race
There are zero stops of Hispanic individuals reported after 2010.

Oklahoma City, OK

2011-01-01 to 2020-11-19

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	97.5%
lng	97.5%
geocode_source	97.5%
beat	93.8%
division	93.8%
sector	93.8%
subject_age	76.5%
subject_dob	76.5%
subject_race	99.7%
subject_sex	99.6%
officer_id	76.9%
officer_id_hash	76.9%
type	79.6%
violation	100.0%
arrest_made	23.1%
citation_issued	100.0%
outcome	100.0%
speed	31.3%
posted_speed	31.3%
vehicle_color	66.3%
vehicle_make	66.3%
vehicle_model	35.5%
vehicle_registration_state	63.5%
vehicle_year	59.7%
raw_dfnd_race	99.7%

Data notes:

Data is deduplicated on raw columns violDate, violTime, violLocation, DfndRace, DfndSex, and DfndDOB, reducing the number of records by ~15.7%
Partial data from before 2011 is filtered out, although early 2011 still seems to have missing/partial data; the last few months of 2017 are also missing
Search/contraband information is missing
subject_race is based on DfndRace, which is passed through as raw_dfnd_race; though the data do not include classification of drivers as Hispanic.

2023 update:

218,674 new stops added.
Individuals whose race was marked as M have been recoded to other, rather than hispanic, as was the case in the previous release.

Tulsa, OK

2009-01-01 to 2016-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	99.0%
lat	90.6%
lng	90.6%
division	100.0%
subject_race	99.0%
subject_sex	99.2%
type	70.6%
violation	100.0%
speed	38.6%
posted_speed	41.0%
vehicle_color	91.9%
vehicle_make	94.0%
vehicle_model	83.2%
vehicle_registration_state	93.5%
vehicle_year	93.3%
raw_race	99.0%

Data notes:

Data is deduplicated on raw columns violationdate, violation_location, officerdiv, race, and sex, reducing the number of records by ~30.0%
Data is all citations
Data appears to be all vehicular, although the PD hasn't confirmed that yet
subject_race is based on raw column race, which is passed through as raw_race

Statewide, OR

2010-01-01 to 2014-01-01

feature	coverage rate
date	92.6%
subject_race	100.0%
type	100.0%
raw_Race	100.0%

Data notes:

There is basically no data, including no data on Hispanic drivers, so we exclude Oregon from our analysis.
Counts for 2015 and 2016 are much lower than in earlier years.
subject_race is mapped from raw_Race

Philadelphia, PA

2014-01-01 to 2018-04-14

feature	coverage rate
date	100.0%
time	100.0%
location	98.0%
lat	94.4%
lng	94.4%
district	100.0%
service_area	100.0%
subject_age	99.8%
subject_race	100.0%
subject_sex	100.0%
type	100.0%
arrest_made	100.0%
outcome	5.1%
contraband_found	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
raw_race	100.0%
raw_individual_contraband	100.0%
raw_vehicle_contraband	100.0%

Data notes:

Data is deduplicated on raw columns datetimeoccur, location, districtoccur, lat, lng, gender, age, race, stoptype, individual_frisked, individual_searched, individual_arrested, individual_contraband, vehicle_frisked, vehicle_searched, vehicle_contraband, reducing the number of records by ~1.4%
Information on citations and warnings is missing, but arrests are included
search_person and search_vehicle correspond to raw columns individual_searched and vehicle_searched; we filled in false for NA values under the assumption that unrecorded search data represented the absence of a search
contraband_found is based on raw columns individual_contraband and vehicle_contraband, which are passed through as raw_*; if both of these were null and search_conducted was true, contraband_found was set to false
subject_race is based on the raw column race, which is passed through as raw_race
2018 has only partial data, and it appears to be the same for early 2014

Pittsburgh, PA

NOTE: Due to unexplained discrepancies in the raw data provided to us by Pittsburgh PD, we have temporarily removed this dataset from the site.

2008-01-01 to 2018-04-29

feature	coverage rate
date	99.9%
time	99.9%
location	100.0%
lat	97.7%
lng	97.7%
neighborhood	82.7%
subject_age	17.1%
subject_race	88.6%
subject_sex	96.2%
officer_id	100.0%
officer_id_hash	100.0%
officer_age	92.0%
officer_race	77.7%
officer_sex	78.1%
type	100.0%
violation	82.7%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	88.6%
contraband_found	86.8%
frisk_performed	82.7%
search_conducted	100.0%
reason_for_stop	17.3%
raw_zone	8.7%
raw_object_searched	12.5%
raw_race	100.0%
raw_ethnicity	16.4%
raw_zone_division	17.3%
raw_evidence_found	2.0%
raw_weapons_found	0.4%
raw_nothing_found	79.5%
raw_police_zone	82.7%
raw_officer_race	78.1%
raw_officer_zone	82.7%

Data notes:

The raw data for pedestrian stops actually has many cities in it, but here we filter to only Pittsburgh; vehicular stops do not have an associated city, and so are assumed to be only Pittsburgh
Raw data for vehicle stops has stop end time as well
There are instances when evidencefound is true but contrabandfound is NA, so we have an oustanding inquiry as to what evidencefound refers to; similarly, weaponsfound is sometimes true when contrabandfound is false and vice versa, so it's unclear whether the contraband is weapons or not, so for now we leave out contraband_weapons and have another outstanding inquiry
if a search was conducted and the stop type was vehicular (pedestrian stops don't provide search outcomes) and contrabandfound was NA, we set contraband_found to false, otherwise we use the value in the contrabandfound field. We do this under the assumption that false and NA for contraband_found are equivalent when a search occured, i.e. an officer conducted a search and either found nothing or recorded nothing
search_conducted is true when any one of objectsearched (pedestrian stops), contrabandfound, evidencefound, weaponsfound, and nothingfound (vehicular stops) is not NA; all these are passed on as raw_*
Sex and gender do not match 73% of the time in pedestrian data, and race and ethnicity mismatch often as well. In both cases, if sex != gender or race != ethnicity, we set the value to NA, otherwise we coalesce(sex, gender) or coalesce(race, ethnicity) [this keeps values when one is NA but the other isn't]; we pass through all the raw values as raw_*
There are 4 zone-related columns in the raw data: zone, zone_division, policezone, and officerzone; we pass them through as raw_*
The data is deduplicated on raw columns stop_date, stopstart, stopend, address, officer_id, and person_id, reducing the number of rows by ~21.1%
violation represents raw column crimedescription
2008 and early 2009 appear to have partial data and 2018 only has the first 4 months

Statewide, RI

2005-01-02 to 2015-12-31

feature	coverage rate
date	100.0%
time	100.0%
zone	100.0%
subject_yob	94.3%
subject_race	94.3%
subject_sex	94.3%
department_id	100.0%
type	100.0%
arrest_made	94.3%
citation_issued	94.3%
warning_issued	94.3%
outcome	93.0%
contraband_found	100.0%
contraband_drugs	73.0%
contraband_weapons	9.3%
contraband_alcohol	0.2%
contraband_other	3.5%
frisk_performed	100.0%
search_conducted	100.0%
search_basis	100.0%
reason_for_search	100.0%
reason_for_stop	94.3%
vehicle_make	62.4%
vehicle_model	45.1%
raw_BasisForStop	94.3%
raw_OperatorRace	94.3%
raw_OperatorSex	94.3%
raw_ResultOfStop	94.3%
raw_SearchResultOne	3.5%
raw_SearchResultTwo	0.2%
raw_SearchResultThree	0.0%

Data notes:

The stops are mapped to state patrol zones, which represent police barrack juridisdiction areas. However, there is no simple mapping between zones and counties. We store state patrol zones in the district column and use this column in our granular location analyses.
contraband information was mapped from raw_SearchResult[One/Two/Three].
Column search_basis is a standardized version of reason_for_search, which, if multiple reasons are provided, uses the hierarchy of: plain view, probable cause, other. And if no search reason is given, we default to probable cause. Note that while the raw data contains a ConsentRequested column, we have no information about whether consent was given.
Additional columns in the raw data that may be of interest: SearchFrisk[One/Two/Three] (says whether searches and frisks were of the driver, passenger, or vehicle), Duration (A/B/C/NA), AdditionalOccupants, Road (I/S/N/NA), PlateType, PriorRecord (Y/N/T/NA), ConsentRequested.

Statewide, SC

2005-01-01 to 2016-12-31

feature	coverage rate
date	100.0%
location	100.0%
lat	23.4%
lng	23.4%
county_name	100.0%
subject_age	100.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	98.5%
officer_id_hash	98.5%
officer_age	99.5%
officer_race	100.0%
officer_last_name	100.0%
department_id	100.0%
type	100.0%
violation	66.3%
arrest_made	100.0%
citation_issued	100.0%
outcome	66.3%
contraband_found	100.0%
contraband_drugs	99.2%
contraband_weapons	99.2%
contraband_alcohol	0.0%
contraband_other	2.1%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
reason_for_stop	100.0%
raw_contact_type	100.0%
raw_sex	100.0%
raw_race	100.0%
raw_sectionnum	66.3%
raw_offensecode	66.3%
raw_contrabanddesc	0.7%
raw_officer_race	100.0%

Data notes:

The police_department field is populated by state patrol agency.
More data on local stops is available here. It is aggregated by race and age group — potentially scrapable if useful.
While there is data on violation, many of the stops have missing data.
Violation is a concatenation of sectionnum and offensecode in the raw data.
Additional columns in the raw data that may be of interest (Note, many of these
Original officer_badge_number is not unique, so it is hashed with the officer's last name and race to create officer_id_hash were used to construct search/contraband/arrest/outcome information in the clean data. See processing script for details.): jailed, felonyarrest, armedwith (messy free field), using[drugs/alcohol], contraband[drugs/drugparaphenalia/weapons/other] (sic), [passenger/subject/vehicle]searched.

Statewide, SD

2012-01-01 to 2016-02-29

feature	coverage rate
date	100.0%
time	99.9%
location	16.6%
county_name	99.8%
subject_sex	99.2%
type	100.0%
violation	98.9%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
vehicle_color	76.0%
vehicle_make	92.4%
vehicle_model	79.6%
vehicle_registration_state	98.2%
vehicle_year	77.2%

Data notes:

Race data is missing, so we exclude South Dakota from our analysis.
Some county names were misrecorded.
Additional columns in raw data that may be of interest: Eye Color, Insurance, Commerical Vehicle (sic), Is Accident, Haz Mat Vehicle.

Statewide, TN

1971-01-05 to 2016-06-26

feature	coverage rate
date	100.0%
time	100.0%
location	71.0%
county_name	99.1%
subject_race	99.2%
subject_sex	99.8%
department_id	100.0%
department_name	100.0%
type	100.0%
violation	92.3%
citation_issued	100.0%
outcome	100.0%
vehicle_make	99.4%
vehicle_model	95.5%
vehicle_year	94.5%
raw_ORIG_TRFC_VIOL_CDE	100.0%
raw_CNTY_NBR	100.0%
raw_RACE_IND	99.2%
raw_SEX_IND	99.8%

Data notes:

The data contain only citations.
The codes in the CNTY_NBR field represent counties ordered alphabetically.
location is a concatenation of raw fields UP_STR_HWY (highway/street) and MLE_MRK_NBR (mile marker). It would be possible to map the highway and mile marker data to geo coordinates, as we did in Washington. However, since we are often missing mile marker or even mile marker and highway, we did not do so (as most would be NA).
raw_ORIG_TRFC_VIOL_CDE maps to violation, raw_CNTY_NBR maps to county_name, raw_RACE_IND maps to subject_race, raw_SEX_IND maps to subject_sex.
Additional raw data columns that may be of interest: SPEED, SPEED_LMT, TN_RSDNT_IND (resident boolean), HZRD_MTRL_IND (hazardous material boolean), MTR_CYCL_IND (motorcycle boolean), CNSTR_ZNE (construction zone boolean), WRKR_PRSNT (worker present in construction zone boolean), TRVL_DRCT (travel direction), ACCD_IND (accident boolean), CMV_IND (commercial vehicle boolean)

Nashville, TN

2010-01-01 to 2019-03-24

feature	coverage rate
date	100.0%
time	99.8%
location	100.0%
lat	94.0%
lng	94.0%
precinct	87.4%
reporting_area	89.2%
zone	87.4%
subject_age	100.0%
subject_race	99.9%
subject_sex	99.6%
officer_id	100.0%
officer_id_hash	100.0%
type	100.0%
violation	99.7%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	99.9%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	100.0%
reason_for_stop	99.7%
vehicle_registration_state	99.0%
notes	16.6%
raw_verbal_warning_issued	100.0%
raw_written_warning_issued	84.0%
raw_traffic_citation_issued	100.0%
raw_misd_state_citation_issued	77.6%
raw_suspect_ethnicity	100.0%
raw_driver_searched	100.0%
raw_passenger_searched	100.0%
raw_search_consent	100.0%
raw_search_arrest	100.0%
raw_search_warrant	100.0%
raw_search_inventory	100.0%
raw_search_plain_view	100.0%

Data notes:

Data is deduplicated on raw columns stop_date_time, stop_location_street, officer_employee_number, race, sex, and age_of_suspect, reducing the number of records by ~0.3%
There are 30 (of ~2.6M records) cases where search_conducted is ambiguous after the merge and are left as NA, since it's unclear whether they are true or false, since being NA after the above merge indicates that there were two distinct values for raw column searchoccur
reason_for_stop and violation are both translations of the original stop_type column; this column is sometimes the pretextual reason for the stop and does not always represent what the individual was ultimately cited for
contraband_drugs is raw column drugs_seized, contraband_weapons is weapons_seized, and contraband_found is evidenceseized
citation_issued is derived from traffic_citation_issued and misd_state_citation_issued, which are passed through as raw_*; misd_state_citation_issued is sometimes NA, so for the purposes of defining citation_issued, we consider NA to be false
warning_issued is derived from verbal_warning_issued and written_warning_issued, which are passed through as raw_*; written_warning_issued is sometimes NA, so for the purposes of defining warning_issued, we consider NA to be false
search_basis is based on the raw columns search_plain_view, search_consent, search_incident_to_arrest, search_warrant, and search_inventory, which are all passed on with the raw_* prefix
subject_race is derived from raw columns suspect_ethnicity and suspect_race, which are passed through with the raw_* prefix
search_person is derived from search_driver and search_passenger, which are passed through with the raw_* prefix
When contraband_found is NA, we fill it with false when a search occurred, under the assumption that the officer simply didn't record the absence of contraband

Arlington, TX

2016-01-01 to 2016-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	99.8%
lng	99.8%
beat	99.0%
district	99.0%
sector	98.1%
subject_race	100.0%
subject_sex	100.0%
officer_id	99.9%
officer_id_hash	99.9%
type	100.0%
outcome	0.0%
search_conducted	100.0%
reason_for_stop	100.0%
raw_1st_digit_race	100.0%
raw_4th_digit_final_outcome	100.0%
raw_6th_digit_search_outcome	100.0%

Data notes:

Unclear what PRA, xCoordinate, and yCoordinate are in the raw data
Missing data dictionaries for reason_for_stop, outcome, and search_ outcome, the latter two are passed through as raw_*
subject_race is based on raw column 1st digit (Race), which is passed through as raw_1st_digit_race
Only 2016 data was provided

Austin, TX

2006-01-01 to 2016-06-30

feature	coverage rate
date	100.0%
subject_age	99.4%
subject_race	100.0%
subject_sex	99.9%
officer_id	100.0%
officer_id_hash	100.0%
type	77.2%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	100.0%
reason_for_stop	100.0%
vehicle_make	97.4%
vehicle_model	31.0%
vehicle_registration_state	98.5%
vehicle_year	77.1%
raw_ethnicity	66.6%
raw_person_search_search_based_on	3.6%
raw_person_search_search_discovered	4.1%
raw_person_searched	89.8%
raw_vehicle_search_search_based_on	2.0%
raw_vehicle_search_search_discovered	2.4%
raw_vehicle_searched	83.4%
raw_race_description	100.0%
raw_street_check_description	100.0%

Data notes:

Data is deduplicated on raw column street_check_case_number, occurred_date, officer, sex, race, ethnicity, yob, veh_type, veh_year, veh_make, veh_model, veh_style, and soi, reducing the number of rows by ~0.5%
Data does not include location or outcomes
There are no clear pedestrian-only discretionary stops in reason_checked_description; SUSPICIOUS PERSON / VEHICLE is one category in reason_for_stop, but is included with "vehicular" stops; as such, it may over count vehicular stops
reason_for_stop represents raw column reason_checked_description
search_person and search_vehicle represent person_searched and vehicle_searched in the raw data, which are passed through with raw_*; for the canonical columns search_person and search_vehicle, NA values are changed to false under the assumption that the absence of a search may not always be recorded
frisk_performed is based on person_search_search_based_on, and is false when that column is NA, on the assumption that the officer did not record the absence of a frisk
search_basis is derived from person_search_search_based_on and vehicle_search_search_based on, which are passed through with the raw_* prefix
contraband_{found,drugs,weapons} are derived from person_search_search_discovered and vehicle_search_search_discovered, which are passed through with raw_* prefix; when these values are NA, they are assumed to be FALSE for contraband discovery
reason_for_stop represents the raw column reason_checked_description; although, the raw column street_check_description also seems to provide information, so is passed through with the raw_ prefix
subject_race is based on raw columns race and ethnicity; there is also a raw race_description column, which is passed through with the raw_ prefix (instead of the race column, since it is just a nicer translation of the single characters in race); ethnicity is also passed through with the raw_ prefix

Garland, TX

2012-01-03 to 2019-06-22

feature	coverage rate
date	100.0%
time	99.9%
location	0.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
officer_race	34.1%
officer_sex	97.4%
officer_first_name	99.9%
officer_last_name	100.0%
type	100.0%
disposition	100.0%
violation	100.0%
citation_issued	100.0%
outcome	100.0%
speed	49.8%
posted_speed	49.8%
vehicle_color	98.6%
vehicle_make	99.5%
vehicle_registration_state	99.2%
vehicle_year	58.4%
raw_race	100.0%
raw_alleged_speed	81.8%
raw_posted_speed	81.8%

Data notes:

Data is deduplicated on raw columns sex, race, vehicle_year, vehicle_color, make, vehicle_state, incident_date, incident_time, and officer_badge, reducing the number of records by ~33.1%
incident_address (location in clean) is 100% null, we have an outstanding inquiry here
We assume these are all citations since they appear to be indexed by ticket number, but we have an outstanding task to clarify this
violation represents offense_title in the raw data
Data is lacking reason_for_stop/search/contraband information
officer_race is mostly NA or "U", the remainder are white or Asian/Pacific Islander, so this data is probably unreliable
subject_race is based on raw column race, which is passed through with the raw_ prefix
Sometimes the same stop has different speeds recorded; often a pair of legitimate values, i.e. going 55 in a 40, but the others will have 0 and 0 or NA and NA, since possibly multiple tickets are issued for the same stop; for each record, we take the max of each to represent the speeds; the raw_alleged_speed and raw_posted_speed are passed through; when the values were 0 or -Inf, we set them to NA under the assumption that this was a stop unrelated to speed
2012 and 2018 have only partial data

Houston, TX

2014-01-01 to 2020-11-04

feature	coverage rate
date	100.0%
time	36.8%
location	91.3%
lat	91.1%
lng	91.1%
geocode_source	91.1%
beat	85.8%
district	85.8%
subject_race	79.8%
subject_sex	99.8%
subject_first_name	100.0%
subject_middle_name	62.6%
subject_last_name	100.0%
type	100.0%
violation	100.0%
citation_issued	100.0%
outcome	100.0%
speed	29.9%
posted_speed	30.8%
vehicle_color	96.6%
vehicle_make	98.6%
vehicle_model	97.1%
raw_race	79.8%

Data notes:

Data is deduplicated on raw columns Defendant Name, Gender, Race, Street, Block, Scnd Street, Scnd Block, Officer Name, and Offense Date, reducing the number of records by ~0.02%; there is a possibility this over collapses rows in the case where an officer pulls over the same person twice in the same day at the same location
Data is lacking search/contraband information
Data consists only of citations
When speed and posted_speed were 0, we set them to NA, under the assumption that this was a default value and the stop was unrelated to speed
subject_race is based on the raw column Race, passed through as raw_race

2023 update:

670,052 new stops added.
2018 has been updated to the full year, and a small number of rows not corresponding to an actual stop contained in the previous release have been removed.

Lubbock, TX

2008-05-01 to 2018-04-30

feature	coverage rate
date	100.0%
location	100.0%
lat	99.7%
lng	99.7%
officer_first_name	71.0%
officer_last_name	99.0%
type	100.0%
disposition	99.6%
citation_issued	99.6%
warning_issued	99.6%
outcome	82.0%

Data notes:

Insufficient information here to deduplicate records, if there are duplicates
Missing reason_for_stop/search/contraband/subject_sex/subject_race data
There is an outstanding ask for a data dictionary for the disposition codes

Plano, TX

2012-01-01 to 2015-12-31

feature	coverage rate
date	100.0%
time	99.4%
location	49.7%
lat	48.6%
lng	48.6%
beat	46.6%
sector	46.6%
subject_age	0.6%
subject_race	100.0%
subject_sex	100.0%
officer_id	23.1%
officer_id_hash	23.1%
officer_last_name	48.5%
unit	23.1%
type	100.0%
violation	100.0%
arrest_made	98.0%
citation_issued	85.9%
warning_issued	31.9%
outcome	98.2%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
search_conducted	100.0%
search_basis	100.0%
speed	12.6%
posted_speed	13.3%
vehicle_color	22.9%
vehicle_make	23.0%
vehicle_model	22.5%
vehicle_type	21.9%
notes	22.8%
raw_race	100.0%
raw_contraband	21.5%
raw_contraband_found	9.1%
raw_results	74.4%
raw_ethnicity	100.0%

Data notes:

Data is rather messy from year to year with different columns, and files with "all_traffic_stops" in the name are difficult to join into the other incident data, since the incident number in those files is populate donly ~15% of the time; location data is spread across 4 columns in different files, and each is null at least 75% of the time
violation is a concatenation of violation_description, primary_violation, offense, and offense_[1-8], which are all null most of the time, the separator is a comma
Data is deduplicated on date, time, location, officer_id, subject_age, subject_race, and subject_sex, reducing the number of records by ~0.0004%, but some of this may be over-deduplication because NAs are common in location, officer_id, and subject_age
location is a coalesced version of the raw columns location, violation_location, offense_location, and arrest_location, all of which are ~75% null independently
raw_results is a concatenation of officer_result, result, and result_[1-8], separated by a comma; outcomes are based on this column, as well as the warning, citation, and citation_number columns in the raw data
search_conducted represents search_conducted, search_performed, and searched raw columns coalesced (they are mutually exclusive); similarly, consent in search_basis is based on search_consent, search_consent_2, and consent in the raw data, coalesced, and arrest_made is a coalesced version of arrest and arrested
When the contraband and contraband_found raw columns are NA, they are assumed to be false or no contraband found for the canonical contraband_found column in the clean data

Statewide, TX

2006-01-01 to 2017-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	92.0%
lat	58.2%
lng	58.2%
county_name	100.0%
district	92.0%
precinct	32.9%
region	92.0%
subject_race	100.0%
subject_sex	100.0%
officer_id	100.0%
officer_id_hash	100.0%
officer_last_name	64.1%
type	100.0%
violation	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
contraband_found	100.0%
contraband_drugs	100.0%
contraband_weapons	100.0%
search_conducted	92.0%
search_vehicle	90.4%
search_basis	100.0%
vehicle_color	45.2%
vehicle_make	71.9%
vehicle_model	66.3%
vehicle_type	99.9%
vehicle_year	67.1%
raw_HA_SEARCH_PC_boolean	92.0%
raw_HA_SEARCH_CONCENT_boolean	92.0%
raw_HA_INCIDTO_ARREST_boolean	92.0%
raw_HA_VEHICLE_INVENT_boolean	92.0%

Data notes:

There is evidence that minority drivers are labeled as white in the data. For example, see this report from KXAN. We remapped the driver race field as provided using the 2000 surnames dataset released by the U.S. Census. See the processing script or paper for details.
We asked whether there was a field which provided arrest data, but received no clarification. There is data on incident to arrest searches, but this does not necessarily identify all arrests.
Based on the provided data dictionary as well as clarification from DPS via email, we classify THP6 and TLE6 in HA_TICKET_TYPE as citations and HP3 as warnings.
The data only records when citations and warnings were issued, but not arrests.
We did not receive any search information in the 2017 data.

San Antonio, TX

2012-01-01 to 2020-06-30

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	99.8%
lng	99.8%
geocode_source	99.8%
district	92.5%
substation	92.5%
subject_age	99.9%
subject_race	99.9%
subject_sex	99.7%
type	97.6%
violation	100.0%
arrest_made	100.0%
citation_issued	100.0%
outcome	100.0%
contraband_found	100.0%
search_conducted	100.0%
search_basis	100.0%
speed	52.0%
posted_speed	52.0%
vehicle_color	83.5%
vehicle_make	83.9%
vehicle_model	83.4%
vehicle_registration_state	83.7%
vehicle_year	82.5%
raw_race	99.9%
raw_posted_speed	100.0%
raw_actual_speed	100.0%
raw_search_reason	15.1%
raw_contraband_or_evidence	15.1%
raw_custodial_arrest_made	14.7%

Data notes:

Data is deduplicated on citation_number, reducing the number of rows by 23.3%; deduping on date, time, location, subject_race, subject_sex, and subject age instead reduces the number of records by 25.3%, roughly 2% more than only citation number, but, curiously, there are often rows that have identical information on those columns but different recorded speeds; so, it's unclear whether these are duplicates with misentries or distinct events; we air on the side of caution and consider them distinct events; there also appears to be multiple offenses related to each citation, and those not involving speed are set to 0; accordingly, we take the maximum speed and posted speed to represent the speeds for every citation/record
Data consists only of arrests and citations
contraband_found is based on the raw column Contraband Or Evidence; when this is NA, it is set to false under the assumption that an officer may not always record the absence of contraband found; the raw column is passed through as raw_contraband_or_evidence
search_basis is based on the raw column Search Reason, which is passed through as raw_search_reason
search_conducted is false when Search Reason is NA, "No Search", or one of the ~200 entries that look like incorrect entries, i.e. A, 9, 6
subject_race is based on the raw column Race and is passed through as raw_race
arrest_made is based on raw column Custodial Arrest Made, which is passed through as raw_custodial_arrest made; arrest_made true when Custodial Arrest Made is true and false when it is false or NA
2018 has only the first 4 months of data

2023 update:

260,675 new stops added.
Approximately thirty stops have been recoded from district 2330 have been recoded as distrction 2360. These stops occurred exactly on the border between these two districts.
Stops are now deduplicated on citation number and year, to reflect that one citation number was recycled during the extended period covered in the new release. This does not affect any records of stops present in the previous release.

Statewide, VA

2006-01-07 to 2016-04-23

feature	coverage rate
date	100.0%
location	100.0%
county_name	87.3%
subject_race	100.0%
officer_id	100.0%
officer_id_hash	100.0%
officer_race	0.0%
officer_first_name	100.0%
officer_last_name	100.0%
type	100.0%
search_conducted	100.0%
raw_officer_race	100.0%
raw_race	100.0%

Data notes:

The original data was aggregated by week.
Some rows have an unlikely high number of stops or searches. We have an outstanding inquiry on this, but have not heard back. In particular, spikes in each week seem to usually be driven by a single officer with an unlikely high number of stops or searches (e.g., about 1,000 searches by an officer in a single week). Each spike seems to be driven by a different officer. Since this reporting seems highly unlikely, we exclude VA from search analyses.
Counties were mapped using the provided dictionary, which is included in the raw data folder.
There are no written warnings in Virginia and verbal warnings are not recorded, so all records are citations or searches without further action taken.
In the raw data, "Traffic arrests" refer to citations without a search. "Search arrests" refer to a citation and a search (either before or after the citation). "Search stops" refer to searches without a corresponding citation.
Additional columns in raw data that may be of interest: officer name.

Burlington, VT

2012-01-01 to 2020-05-27

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	84.4%
lng	84.4%
subject_age	83.8%
subject_dob	83.8%
subject_race	95.4%
subject_sex	96.4%
department_name	100.0%
type	100.0%
violation	98.7%
arrest_made	98.0%
citation_issued	98.0%
warning_issued	98.0%
outcome	98.0%
contraband_found	100.0%
search_conducted	100.0%
search_basis	100.0%
reason_for_search	100.0%
reason_for_stop	84.4%
vehicle_registration_state	14.9%
raw_race	95.5%
raw_gender	96.5%
raw_contraband_evidence	97.5%
raw_outcome_of_stop	98.0%

Data notes:

Data is deduplicated on raw columns issued_at, location, race, gender, city, dob, lat, lon, reducing the number of records by ~7.0%
Calls are also provided in the raw data, but aren't loaded here
subject_race is based on the raw column race which is passed through as raw_race, and gender is passed through as raw_gender
reason_for_stop represents the raw column stop_based_on, and reason_for_search represents the raw column search_based_on and forms the basis for search_conducted and search_basis
When reason_for_search, i.e. search_based_on, is NA, we assume search conducted is false
outcomes are based on raw column outcome_of_stop, which is passed through as raw_outcome_of_stop

2023 update:

5,332 new stops added.

Statewide, VT

2010-07-01 to 2015-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	92.6%
lng	92.6%
subject_age	99.6%
subject_race	98.6%
subject_sex	99.4%
officer_id	100.0%
officer_id_hash	100.0%
department_name	100.0%
type	100.0%
arrest_made	99.2%
citation_issued	99.2%
warning_issued	99.2%
outcome	99.2%
contraband_found	100.0%
search_conducted	100.0%
search_basis	100.0%
raw_stop_city	99.8%
raw_stop_reason_description	99.2%
raw_stop_search_description	99.2%
raw_stop_outcome_description	99.2%
raw_driver_gender	99.4%
raw_driver_race	98.6%

Data notes:

Stop purpose information is not very granular — there are only five categories, and we have no way of identifying speeding. See raw_stop_reason_description.
The search type field includes "Consent search — probable cause" and “Consent search — reasonable suspicion". It is not entirely clear what these mean; we cannot find analogues in other states.
Counties could be mapped by running the cities in the raw_stop_city field through Google's geocoder.
location is a simple concatenation of address, city, state, zip.
search_conducted was mapped from raw_stop_search_description.
contraband_found was mapped from raw_stop_contraband_description.

Statewide, WA

2009-01-01 to 2015-12-31

feature	coverage rate
date	72.9%
time	100.0%
location	91.3%
lat	86.7%
lng	86.7%
county_name	86.7%
subject_age	71.6%
subject_race	71.8%
subject_sex	71.9%
officer_race	72.9%
officer_sex	100.0%
officer_first_name	100.0%
officer_last_name	100.0%
department_name	100.0%
type	100.0%
violation	41.2%
arrest_made	64.3%
citation_issued	100.0%
warning_issued	100.0%
outcome	70.9%
contraband_found	100.0%
frisk_performed	100.0%
search_conducted	100.0%
search_basis	100.0%
raw_officer_race	72.9%
raw_officer_gender	100.0%
raw_contact_type	100.0%
raw_driver_race	71.8%
raw_driver_gender	71.9%
raw_search_type	71.9%
raw_enforcements	70.9%

Data notes:

Counties were mapped by doing a reverse look-up of the geo lat/long coordinate of the highway post that was recorded for the stop, then mapping that latitude and longitude to a county using a shapefile. Details are in the WA_map_locations.R script.
Arrests and citations are grouped together in the stop_outcome, so we cannot reliably identify arrests. There is data on incident to arrest searches, but this does not necessarily identify all arrests.
If one were to dedupe on employee_last, employee_first, officer_race, officer_gender, contact_date, contact_hour, highway_type, road_number, milepost, driver_race, driver_age, driver_gender it would yield ~3.4% fewer rows. Without deduping, there are a few officers who seem to stop a suspiciously high, but not altogether unreasonable, number of people in a an hour. However, we ultimately choose not to dedupe since most of the "duplicate" rows have NA for the driver demographics and other fields.
Weigh station stops were removed.
raw_enforcements is simply a concatenation of 12 enforcement columns in the raw data.
Additional columns in the raw data that may be of interest: officer name
A raw enforcement value of 1 corresponds to either a citation or an arrest. Under the assumption that most values were citations, we set citation_issued = TRUE and arrest_made = NA when raw enforcement value was 1, although this will likely mislabel a small number of arrests.

Tacoma, WA

2007-09-11 to 2017-09-10

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	82.5%
lng	82.5%
sector	79.7%
subsector	79.7%
officer_id	100.0%
officer_id_hash	100.0%
type	100.0%
disposition	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	64.5%

Data notes:

reason_for_stop is not recorded, and search/contraband information is not in their database, only in written reports; subject_race is also not recorded

Seattle, WA

2006-01-01 to 2015-12-31

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	91.8%
lng	91.8%
beat	90.7%
precinct	90.1%
sector	90.7%
subject_age	31.1%
subject_dob	31.0%
subject_race	0.1%
subject_sex	0.1%
officer_id	96.0%
officer_id_hash	96.0%
officer_first_name	85.3%
officer_last_name	96.0%
type	100.0%
disposition	100.0%
violation	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	58.5%
vehicle_color	4.4%
vehicle_make	3.4%
vehicle_model	3.3%
vehicle_registration_state	6.3%
vehicle_year	0.2%
raw_type_description	98.0%
raw_vehicle_description	6.3%

Data notes:

citation_issued includes criminal and non-criminal citations
violation represents raw column mir_description
type is based on violation (mir_description) and type_description, which is passed through as raw_type_description
outcomes are based on disposition, which represents raw column disposition_description
vehicle_* columns are based on a coalesced combination of veh and vehcile (sic) columns in the raw data; this is passed through as raw_vehicle_description
The officer ID associated with the raw column officer_no_1 is not unique, so it is hashed with the officer's last name to create officer_id_hash

Madison, WI

2007-09-28 to 2020-06-16

feature	coverage rate
date	100.0%
time	100.0%
location	92.0%
lat	91.4%
lng	91.4%
geocode_source	91.4%
district	85.3%
sector	85.3%
subject_age	16.1%
subject_race	98.9%
subject_sex	99.6%
officer_first_name	83.9%
officer_last_name	83.9%
type	100.0%
violation	99.5%
citation_issued	100.0%
warning_issued	100.0%
outcome	100.0%
speed	23.0%
posted_speed	23.0%
vehicle_color	94.4%
vehicle_make	97.5%
vehicle_model	20.8%
vehicle_registration_state	98.8%
vehicle_year	97.6%
raw_race	98.9%

Data notes:

Data is deduplicated on raw columns Date, Time, onStreet, onStreetName, OfficerName, Race, Sex, Make, Model, Year, State, Limit, and OverLimit, reducing the numbe rof rows by ~0.7%
violation represents raw column Statute Description
Search/contraband information is missing
Data only includes warnings and citations, no arrests
If there was no Ticket #, this was assumed to be a warning
Shapefiles don't include district 2 and it's accompanying sectors
subject_race is based on raw column Race, passed through as raw_race
2007 has partial data and looks suspect; 2017 is missing October, November, and December

2023 update:

53,526 new stops added.
Data from 2017 cover only the period January to September; we did not obtain stops for October through December of that year.

Statewide, WI

2010-01-01 to 2016-05-16

feature	coverage rate
date	100.0%
time	100.0%
location	100.0%
lat	32.7%
lng	32.7%
county_name	100.0%
subject_race	85.5%
subject_sex	85.6%
officer_first_name	99.6%
officer_last_name	100.0%
department_id	100.0%
department_name	100.0%
type	100.0%
violation	100.0%
arrest_made	100.0%
citation_issued	100.0%
warning_issued	100.0%
outcome	99.9%
contraband_found	100.0%
contraband_drugs	91.4%
contraband_weapons	78.8%
contraband_alcohol	0.8%
contraband_other	1.2%
search_conducted	100.0%
search_person	100.0%
search_vehicle	100.0%
search_basis	99.9%
vehicle_color	86.9%
vehicle_make	87.0%
vehicle_model	74.6%
vehicle_type	87.3%
vehicle_registration_state	72.2%
vehicle_year	77.3%
raw_onHighwayDirection	87.6%
raw_onHighwayName	89.7%
raw_fromAtStreetName	75.9%
raw_race	85.5%
raw_sex	85.6%
raw_individualSearchConducted	85.7%
raw_vehicleSearchConducted	100.0%
raw_individualContraband	1.1%
raw_vehicleContraband	1.0%
raw_summaryOutcome	100.0%
raw_individualSearchBasis	1.1%
raw_vehicleSearchBasis	1.0%

Data notes:

The data come from two systems ("7.3" and "10.0") that succeeded each other. They have different field names and are differently coded. This is particularly relevant for the violation field, which has a different encoding between the two systems; in order to map violations, we used the dictionaries provided by the state for both systems.
There are two copies of the data: warnings and citations. Citations seems to be a strict subset of warnings, with some citation codes being different.
The police_department field is populated by highway patrol agencies. There are only 6 of them.
There are very few consent searches relative to other states, suggesting a potential difference in recording policy.
raw_[individual/vehicle]Contraband were mapped using a data dictionary provided by the department: 01 = WEAPON(S); 02 = EXCESSIVE CASH; 03 = ILLICIT DRUG(S)/PARAPHERNALIA; 04 = EVIDENCE OF A CRIME; 05 = INTOXICANT(S); 06 = STOLEN GOODS; 99 = OTHER; 00 = NONE
raw_[individual/vehicle]SearchBasis were mapped using a data dictionary. There is no code for "plain view". 1 = Consent; 2 = Probable Cause; the rest of the search basis categories are are Warrant, Incident to Arrest, Inventory, and Exigent Circumstances
violation was mapped directly from StatuteDescription in the raw data.
location is a concatenation of raw_onHighwayDirection, raw_onHighwayName, raw_fromAtStreetName, and county_name
There are about 150 columns in the raw data (many columns about road type and conditions, many about vehicle details, etc.), however, the vast majority of the columns are 95-100% empty.

Statewide, WY

2011-01-01 to 2012-12-31

feature	coverage rate
date	99.8%
time	100.0%
location	100.0%
county_name	98.7%
subject_age	99.6%
subject_race	99.7%
subject_sex	99.3%
officer_id	100.0%
officer_id_hash	100.0%
department_id	93.8%
type	100.0%
outcome	100.0%
raw_race	99.7%
raw_sex	99.7%
raw_streetnbr	97.0%
raw_street	99.5%

Data notes:

Only citations are included in the data.
The department_name field is populated by the state trooper division.
The violation field is populated by violated statute codes.
Rows represent citations, not stops, so we remove duplicates by grouping by the other fields.
contraband_found could potentially be derived from violation codes (drug/alcohol/weapons), but it would be less reliable and not necessarily comparable to how we defined contraband_found for other states, so we do not.
department_id was mapped directly from emdivision in the raw data.
violation was mapped directly from charge in the raw data.
location is a concatenation of raw_streetnbr, raw_street, and city (and note that city is actually county, and is mapped to county_name with light standardization).
Additional columns in raw data that may be of interest: statute, is_acciden.

Changelog

December 16th, 2019:
- More stringent deduping logic
- Contraband found set to FALSE when NA and search conducted is true
- Predication correction added to metadata
- Six more cities
April 3rd, 2020:
- Pulled out raw_HA_RACE_SEX in Texas Statewide data
- Seattle, WA and South Carolina Statewide data had non-unique officer IDs, so officer_id_hash now hashes those original IDs with other personal information to create a unique hash; see officer_id_hash in table description for more information

Files

data_readme.md

Latest commit

History

data_readme.md

File metadata and controls

Overview of the data file structure

Description of standardized data

Best practices

2023 Update

Little Rock, AR

2017-01-01 to 2017-11-03

Gilbert, AZ

2008-01-01 to 2018-05-23

Mesa, AZ

2014-01-01 to 2019-09-23

Statewide, AZ

2009-01-06 to 2017-12-31

Anaheim, CA

2012-01-01 to 2017-03-14

Bakersfield, CA

2008-03-09 to 2018-03-09

Oakland, CA

2013-04-01 to 2017-12-31

San Bernardino, CA

2011-12-13 to 2017-09-19

Long Beach, CA

2008-01-01 to 2017-12-31

Los Angeles, CA

2010-01-01 to 2018-06-23

San Diego, CA

2014-01-01 to 2017-03-31

San Francisco, CA

2007-01-01 to 2016-06-30

San Jose, CA

2013-09-01 to 2018-03-31

Santa Ana, CA

2014-06-11 to 2018-04-13

Statewide, CA

2009-07-01 to 2016-06-30

Stockton, CA

2012-01-01 to 2016-12-31

Aurora, CO

2012-01-01 to 2020-07-31

Denver, CO

2010-12-31 to 2018-07-19

Statewide, CO

2010-01-01 to 2017-12-31

Statewide, CT

2013-10-01 to 2015-10-01

Hartford, CT

2013-10-13 to 2016-09-29

Tampa, FL

1973-06-21 to 2018-03-07

Saint Petersburg, FL

2010-01-01 to 2010-07-29

Statewide, FL

2010-01-01 to 2018-12-31

Statewide, GA

2012-01-01 to 2016-12-31

Statewide, IA

2006-01-01 to 2016-04-25

Idaho Falls, ID

2008-08-13 to 2016-07-25

Statewide, IL

2012-01-01 to 2017-12-31

Chicago, IL

2012-01-01 to 2020-05-16

Fort Wayne, IN

2007-09-01 to 2017-09-30

Wichita, KS

2006-01-01 to 2020-05-15

Louisville, KY

2015-01-01 to 2020-06-16

Owensboro, KY

2015-09-01 to 2017-09-01

New Orleans, LA

2010-01-01 to 2018-07-18

Statewide, MA

2007-01-01 to 2015-12-31