Just when you think it’s safe outside. Here are some statistics relating to traffic collisions in New York City.
Background
NYC publishes vehicle collision data which anyone can access using their API. You can also download this information in standard CSV (Comma Separated Values) file format. The file is fairly large, 420 MB, with almost 2 Million lines.
-rw-rw-r-- 1 austin austin 402M Mar 4 20:38 all_motor_vehicle_collision_data.csv
…
bash > wc -l all_motor_vehicle_collision_data.csv
1972886 all_motor_vehicle_collision_data.csv
Display the first five records of the dataset using head
bash > head -n5 all_motor_vehicle_collision_data.csv
CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
09/11/2021,2:39,,,,,,WHITESTONE EXPRESSWAY,20 AVENUE,,2,0,0,0,0,0,2,0,Aggressive Driving/Road Rage,Unspecified,,,,4455765,Sedan,Sedan,,,
03/26/2022,11:45,,,,,,QUEENSBORO BRIDGE UPPER,,,1,0,0,0,0,0,1,0,Pavement Slippery,,,,,4513547,Sedan,,,,
06/29/2022,6:55,,,,,,THROGS NECK BRIDGE,,,0,0,0,0,0,0,0,0,Following Too Closely,Unspecified,,,,4541903,Sedan,Pick-up Truck,,,
09/11/2021,9:35,BROOKLYN,11208,40.667202,-73.8665,"(40.667202, -73.8665)",,,1211 LORING AVENUE,0,0,0,0,0,0,0,0,Unspecified,,,,,4456314,Sedan,,,,
Using head to display the first record only
bash > head -n1 all_motor_vehicle_collision_data.csv
CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
Use Perl to list the column names in numerical order
bash > perl -F, -an -E '$. == 1 && say $i++ . "\t$_" for @F' all_motor_vehicle_collision_data.csv
0 CRASH DATE
1 CRASH TIME
2 BOROUGH
3 ZIP CODE
4 LATITUDE
5 LONGITUDE
6 LOCATION
7 ON STREET NAME
8 CROSS STREET NAME
9 OFF STREET NAME
10 NUMBER OF PERSONS INJURED
11 NUMBER OF PERSONS KILLED
12 NUMBER OF PEDESTRIANS INJURED
13 NUMBER OF PEDESTRIANS KILLED
14 NUMBER OF CYCLIST INJURED
15 NUMBER OF CYCLIST KILLED
16 NUMBER OF MOTORIST INJURED
17 NUMBER OF MOTORIST KILLED
18 CONTRIBUTING FACTOR VEHICLE 1
19 CONTRIBUTING FACTOR VEHICLE 2
20 CONTRIBUTING FACTOR VEHICLE 3
21 CONTRIBUTING FACTOR VEHICLE 4
22 CONTRIBUTING FACTOR VEHICLE 5
23 COLLISION_ID
24 VEHICLE TYPE CODE 1
25 VEHICLE TYPE CODE 2
26 VEHICLE TYPE CODE 3
27 VEHICLE TYPE CODE 4
28 VEHICLE TYPE CODE 5
Explanation
'perl -an -E'
- Split up the column values into array
'@F'
- Split up the column values into array
'-F,'
- Specifies a comma field separator.
'$. == 1'
- The Perl special variable
'$.'
contains the current line number. - Display the first line only.
- The Perl special variable
'say $i++ . "\t$_" for @F'
- Prints a tab separated counter variable
'$i'
, and the corresponding column name, stored in the Perl default variable'$_'
.
- Prints a tab separated counter variable
Create a work-file containing the zip-code, injury count, and fatality count. Only get records that include a zip-code and at least one injury or fatality.
3 ZIP CODE
10 NUMBER OF PERSONS INJURED
11 NUMBER OF PERSONS KILLED
Method
- Unfortunately the previous method for splitting a comma delimited file has limitations. It cannot handle fields with embedded commas. The Street Name fields often have embedded commas which will throw off our column numbering.
- To get around this we can use Text::CSV, which has both functional and OO interfaces. For one-liners, it exports a handy csv function. From the Text::CSV documentation
'my $aoa = csv (in => "test.csv") or die Text::CSV_XS->error_diag;''
, it’ll convert the CSV file into an array of arrays. - I’ll modify this example slightly to
'csv( in => $ARGV[0], headers => qq/skip/ )'
. The @ARGV array contains any input arguments. The first element $ARGV[0] will contain the input CSV file. We don’t need the header row, so it’ll be skipped.
perl -MText::CSV=csv -E '$aofa = csv( in => $ARGV[0], headers => qq/skip/ ); ( $_->[3] =~ /^\S+$/ ) && say qq/$_->[3],$_->[10],$_->[11]/ for @{$aofa}' all_motor_vehicle_collision_data.csv | sort -t, -k 1 -r > sorted_injured_killed_by_zip.csv
Explanation
- Input file
'all_motor_vehicle_collision_data.csv'
'perl -MText::CSV=csv'
- Run the perl command with
'-M'
switch to load a Perl module.
- Run the perl command with
'Text::CSV=csv'
- Export the ‘csv’ function from the
'Text::CSV'
module.
- Export the ‘csv’ function from the
'( $_->[3] =~ /^\S+$/ )'
- Use a Regular expression to only process rows that have non-blank data in the ZIP CODE field.
'say qq/$_->[3],$_->[10],$_->[11]/ for @{$aofa}'
- Loop through the Array of Arrays
'$aofa'
- Print the contents of columns 3,10,11 followed by a line break.
- Loop through the Array of Arrays
- The output is piped
'|'
into the Linux sort command.- Sorting on the first field, ZIP CODE and redirecting,
'>'
into a new file,'sorted_injured_killed_by_zip.csv'
. - See the ss64.com site for more details on the Linux sort command.
- Sorting on the first field, ZIP CODE and redirecting,
- The new file has about 1.36 Million lines.
Get a line count with wc. Display the first 10 records using head
bash > wc -l sorted_injured_killed_by_zip.csv
1359291 sorted_injured_killed_by_zip.csv
bash > head -n10 sorted_injured_killed_by_zip.csv | column -t -s, --table-columns=ZipCode,#Injured,#Killed
ZipCode #Injured #Killed
11697 4 0
11697 3 0
11697 2 0
11697 2 0
11697 2 0
11697 1 0
11697 1 0
11697 1 0
11697 1 0
11697 1 0
Explanation
'wc -l'
- Counts the number of lines in our new file
'head -n 10'
- Prints out the first 10 lines of the file
'column -t -s, --table-columns=ZipCode,#Injured,#Killed'
- column
'-t'
switch will tell'column'
to print in table format.'-s'
switch specifies an input delimiter of ‘,’.- The output is tabbed.
List the 10 worst zip codes for injuries
We can use the output file, sorted_injured_killed_by_zip.csv, from the previous example,
perl -n -E '@a=split(q/,/,$_);$h{$a[0]} += $a[1]; END{say qq/$_,$h{$_}/ for keys %h}' sorted_injured_killed_by_zip.csv | sort -nr -t, -k 2 | head -n10 | column -t -s, --table-columns=ZipCode,#Injured
ZipCode #Injured
11207 10089
11236 7472
11203 7426
11212 6676
11226 6103
11208 6027
11234 5505
11434 5403
11233 5159
11385 4440
Explanation
'@a=split(q/,/,$_);'
- As there are no embedded commas in this file we use the Perl ‘split’ function to break up the 3 CSV fields in each row into array
'@a'
.
- As there are no embedded commas in this file we use the Perl ‘split’ function to break up the 3 CSV fields in each row into array
'$h{$a[0]} += $a[1];'
- The first element of each row, ZIP CODE is used as a key for Hash’
%h'
. - The value is the accumulated number of injuries for that ZIP CODE.
- The first element of each row, ZIP CODE is used as a key for Hash’
'$h{$a[0]} += $a[1]'
- We accumulate the second element, $[1], which contains
'NUMBER OF PERSONS INJURED'
- We can set a value for a Hash key without checking if it exists already.
- This is called Autovivification which is explained nicely by The Perl Maven.
- We accumulate the second element, $[1], which contains
'END{say qq/$_,$h{$_}/ for keys %h}'
- The
'END{}
‘block runs after all the rows are processed. - The keys(Zip Codes) are read and printed along with their corresponding values.
- We could have used Perl to sort the output by the keys, or values.
- I used the Linux sort.
- The
'sort -nr -t, -k 2'
- Will perform a numeric sort, descending on the # of people injured.
'head -n10'
- Will get the first 10 records printed.
'column -t -s, --table-columns=ZipCode,#Injured'
The
‘columns’` command will produce a prettier output.'-t'
for table format.'-s'
to specify that the fields are comma separated'--table-columns''
to add column header names.
Observation
Zip code 11207, which encompasses East New York, Brooklyn, as well as a small portion of Southern Queens, has a lot of issues with traffic safety.
Display the 10 worst zip codes for traffic fatalities
bash > perl -n -E '@a=split(q/,/,$_);$h{$a[0]} += $a[2]; END{say qq/$_,$h{$_}/ for keys %h}' sorted_injured_killed_by_zip.csv | sort -nr -t, -k 2 | head -n10 | column -t -s, --table-columns=ZipCode,#Killed
ZipCode #Killed
11236 44
11207 34
11234 29
11434 25
11354 25
11229 24
11208 24
11206 23
11233 22
11235 21
Explanation
- With a few minor adjustments, we got the worst zip codes for traffic collision fatalities
'$h{$a[0]} += $a[2]'
- Accumulate the third element, $[2], which contains
'NUMBER OF PERSONS KILLED'
- Accumulate the third element, $[2], which contains
Observation
- Zip code 11236, which includes Canarsie Brooklyn is the worst for traffic fatalities according to this data.
- Zip code 11207 is also very bad for traffic fatalities, as well as being the worst for collision injuries
- These stats are not 100 percent correct, as out of 1,972,886 collision records, 1,359,291 contained Zip codes.
- We have 613,595 records with no zip code, which were not included in the calculations.
Some NYC Borough Stats
Similar to how we created the 'sorted_injured_killed_by_zip.csv'
, we can run the following command sequence to create a new file 'sorted_injured_killed_by_borough.csv'
perl -MText::CSV=csv -E '$aofa = csv( in => $ARGV[0], headers => qq/skip/ ) ; ( $_->[2] =~ /^\S+/ ) && say qq/$_->[2],$_->[10],$_->[11]/ for @{$aofa}' all_motor_vehicle_collision_data.csv | sort -t, -k 3rn -k 2rn -k 1 >| sorted_injured_killed_by_borough.csv
Explanation
- The Borough field is the third column,
'2 BOROUGH'
, starting from 0, in the'all_motor_vehicle_collision_data.csv'
file. '( $_->[2] =~ /^\S+/ )'
- Only get rows which have non blank data in the BOROUGH field.
'sort -t, -k 3rn -k 2rn -k 1'
- I added some more precise sorting, which is unnecessary except to satisfy my curiosity.
'-k 3rn
”- Sort by column 3(starting @ 1), which is the fatality count field.
- This is sorted numerically in descending order.
'-k 2rn
”- When equal, the injury count is also sorted numerically, descending.
'-k 1'
- The Borough is sorted in ascending order as a tiebreaker.
Display the first 10 rows of this file.
bash > head -n10 sorted_injured_killed_by_borough.csv | column -t -s, --table-columns=Borough,#Injured,#Killed
Borough #Injured #Killed
MANHATTAN 12 8
QUEENS 3 5
QUEENS 15 4
QUEENS 1 4
STATEN ISLAND 6 3
BROOKLYN 4 3
BROOKLYN 3 3
QUEENS 3 3
BROOKLYN 1 3
QUEENS 1 3
Sanity check if we got all five boroughs
cut -d, -f 1 sorted_injured_killed_by_borough.csv | sort -u
BRONX
BROOKLYN
MANHATTAN
QUEENS
STATEN ISLAND
Explanation
'cut -d, -f 1'
- cut to split the comma delimited file records.
'-d,'
- Specifies that the cut will comma delimited
'-f 1'
- Get the first field from the
cut
, which is the Borough Name.
- Get the first field from the
'sort -u'
- Sorts and prints only the unique values to STDOUT
- We got all 5 New York City boroughs in this file.
Display collision injuries for each borough
bash > perl -n -E '@a=split(q/,/,$_);$h{$a[0]} += $a[1]; END{say qq/$_,$h{$_}/ for keys %h}' sorted_injured_killed_by_borough.csv | sort -nr -t, -k 2 | column -t -s,
BROOKLYN 137042
QUEENS 105045
BRONX 62880
MANHATTAN 61400
STATEN ISLAND 15659
Observation
- Brooklyn emerges as the Borough with the most traffic injuries.
Display collision fatalities by Borough
bash > perl -n -E '@a=split(q/,/,$_);$h{$a[0]} += $a[2]; END{say qq/$_,$h{$_}/ for keys %h}' sorted_injured_killed_by_borough.csv | sort -nr -t, -k 2 | column -J -s, --table-columns Borough,#Killed
{
"table": [
{
"borough": "BROOKLYN",
"#killed": "564"
},{
"borough": "QUEENS",
"#killed": "482"
},{
"borough": "MANHATTAN",
"#killed": "300"
},{
"borough": "BRONX",
"#killed": "241"
},{
"borough": "STATEN ISLAND",
"#killed": "88"
}
]
}
Explanation
- Similar to the Injury count by Borough, this counts all fatalities by borough and prints the output in JSON format.
'column -J -s, --table-columns Borough,#Killed'
Use the'column'
command with the'-J'
switch, for JSON, instead of'-t'
for Table.
I forgot to mention what date range is involved with this dataset. We can check this with the cut command.
cut -d, -f1 all_motor_vehicle_collision_data.csv | cut -d/ -f3 | sort -u
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
Explanation
- Get the date field,
'0 CRASH DATE'
which is in'mm/dd/yyyy'
format. 'cut -d, -f '
'all_motor_vehicle_collision_data.csv'
- Get` the first column/field of data for every row of this CSV file.
'-d,'
specifies that we are cutting on the comma delimiters.'-f 1'
specifies that we want the first column/field only- This is the date in
'mm/dd/yyyy'
format.
- This is the date in
'cut -d/ -f3'
- Will cut the date using
'/
’ as the delimiter. - Grab the third field from this, which is the four digit year.
- Will cut the date using
'sort -u'
- The years are then sorted with duplicates removed.
Observation
- The dataset started sometime in 2012 and continues until now, March 2023.
Display the 20 worst days for collisions in NYC
bash > cut -d, -f1 all_motor_vehicle_collision_data.csv | awk -F '/' '{print $3 "-" $1 "-" $2}' | sort | uniq -c | sort -k 1nr | head -n20 | column -t --table-columns=#Collisions,Date
#Collisions Date
1161 2014-01-21
1065 2018-11-15
999 2017-12-15
974 2017-05-19
961 2015-01-18
960 2014-02-03
939 2015-03-06
911 2017-05-18
896 2017-01-07
884 2018-03-02
883 2017-12-14
872 2016-09-30
867 2013-11-26
867 2018-11-09
857 2017-04-28
851 2013-03-08
851 2016-10-21
845 2017-06-22
845 2018-06-29
841 2018-12-14
Explanation
- Get a count for all collisions for each date on record
- Display the first 20 with the highest collision count
- cut
- Get the first column from the dataset.
- Pipe this date into the awk command.
- AWK is a very useful one-liner tool as well as being a full scripting language.
'awk -F '/' '{print $3 "-" $1 "-" $2}'
'-F '/' '
- Split the date into separate fields using the
’/’
as a delimiter. - $1 contains the month value, $2 contains the day of month and $3 contains the four digit year value.
- These will be printed in the format
'”yyyy-mm-dd”
.
- Split the date into separate fields using the
- Dates are then sorted and piped into the uniq command.
'uniq -c'
- Will create a unique output.
'-c'
switch gets a count of all the occurrences for each value.
- The output is piped into another sort command, which sorts by the number of occurrences descending.
Observation
- I’m not sure if there is any explanation for why some days have a lot more collisions than others. Weatherwise, January 21 2014 was a cold day, but otherwise uneventful. November 15 2018 had some snow, but not a horrific snowfall. The clocks went back on November 4, so that wouldn’t be a factor.
- 2014-01-21 weather
- 2018-11-15 weather
Display the twenty worst times during the day for collisions
bash > cut -d, -f2 all_motor_vehicle_collision_data.csv | sort | uniq -c | sort -k 1nr | head -n20 |column -t --table-columns=#Collisions,Time
#Collisions Time
27506 16:00
26940 17:00
26879 15:00
24928 18:00
24667 14:00
22914 13:00
20687 9:00
20641 12:00
20636 19:00
19865 16:30
19264 8:00
19107 10:00
19106 14:30
19010 0:00
18691 11:00
18688 17:30
16646 18:30
16602 20:00
16144 8:30
16008 13:30
Explanation
- We use the time field,
'1 CRASH TIME'
, which is in 24 hour format, ‘HH:MM’
Observation
- Using the ‘actual time’ may be a bit too specific. I’ll use the ‘hour of day’ instead to give a 60 minute time frame.
bash > cut -d, -f2 all_motor_vehicle_collision_data.csv | cut -d : -f1 | sort | uniq -c | sort -k 1nr | head -n10 | column -t --table-columns=#Collisions,Hour
#Collisions Hour
143012 16
139818 17
132443 14
123761 15
122971 18
114555 13
108925 12
108593 8
105206 9
102541 11
Explanation
- Similar to the previous example, except this time the cut command is used to split the time HH:MM, delimited by
':'
'cut -d : -f 1'
'-d'
- The ‘cut’ delimiter is ‘:’
'-f 1'
- Grab the first field, ‘HH’ of the ‘HH:MM’.
- Use something like the printf command to append
':00'
to those hours.
Observation
As you would expect, most collisions happen during rush hour.
Display the worst years for collisions
bash > cut -d, -f1 all_motor_vehicle_collision_data.csv | cut -d '/' -f3 | sort | uniq -c | sort -k 1nr | head -n10 | column -t --table-columns=#Collisions,Year
#Collisions Year
231564 2018
231007 2017
229831 2016
217694 2015
211486 2019
206033 2014
203734 2013
112915 2020
110546 2021
103745 2022
Explanation
- We use the first column,
'0 CRASH DATE'
again 'cut -d '/' -f3'
- Extracts the
'yyyy'
from the'mm/dd/yyyy'
- Extracts the
Observation
- Some improvement seen in 2020, 2021 and 2022, if you can believe the data.
- One unscientific observation here is that the NYPD may have been much less vigilant in the past few years than they were prior to Commissioner Raymond Kelly and his successors.
- Also, by only printing out the worst 10 years, partial years 2012 and 2023 were excluded.
For a comparison. See how many people were injured or killed in traffic collisions.
First create a work file, 'sorted_injured_killed_by_year.csv'
, with three columns, Year, Injured count and Fatality count
We need the Text::CSV Perl module here due to those embedded commas in earlier fields. Below are the three fields needed.
0 CRASH DATE
10 NUMBER OF PERSONS INJURED
11 NUMBER OF PERSONS KILLED
bash > perl -MText::CSV=csv -E '$aofa = csv( in => $ARGV[0], headers => qq/skip/ ); ($_->[10] || $_->[11]) && say substr($_->[0],6,4) . qq/,$_->[10],$_->[11]/ for @{$aofa}' all_motor_vehicle_collision_data.csv | sort > sorted_injured_killed_by_year.csv
Check out the work file we just created
bash > wc -l sorted_injured_killed_by_year.csv
433081 sorted_injured_killed_by_year.csv
bash > head -n10 sorted_injured_killed_by_year.csv | column -t -s, --table-columns=Year,#Injured,#Killed
Year #Injured #Killed
2012 0 1
2012 0 1
2012 0 1
2012 0 1
2012 0 1
2012 0 1
2012 0 1
2012 0 1
2012 0 1
2012 0 1
Worst years for collision injuries
perl -n -E '@a=split(q/,/,$_);$h{$a[0]} += $a[1]; END{say qq/$_, $h{$_}/ for sort {$h{$b} <=> $h{$a} } keys %h}' sorted_injured_killed_by_year.csv | head -n10 | column -t -s, --table-columns=Year,#Injured
Year #Injured
2018 61941
2019 61389
2017 60656
2016 60317
2013 55124
2022 51883
2021 51780
2015 51358
2014 51223
2020 44615
Explanation
- This is similar to how we got the Zip Code and Borough data previously.
- This time the Perl sort is used instead of the Linux sort.
'END{say qq/$_, $h{$_}/ for sort {$h{$b} <=> $h{$a} } keys %h}'
'for'
statement loops through the'%h'
hash keys(years).- The corresponding Hash values(Injured count), are sorted in descending order.
'sort {$h{$b} <=> $h{$a} }'
.- $a and $b are default Perl sort variables.
- Rearranged it to
'sort {$h{$a} <=> $h{$b} }'
, to sort the injury count in ascending order.
Observation
While the collision count may have gone down, there isn’t any real corresponding downward trend in injuries.
The worst years for collision fatalities.
bash > perl -n -E '@a=split(q/,/,$_);$h{$a[0]} += $a[2]; END{say qq/$_, $h{$_}/ for sort {$h{$b} <=> $h{$a} } keys %h}' sorted_injured_killed_by_year.csv | head -n10 | column -t -s, --table-columns=Year,#Killed
Year #Killed
2013 297
2021 294
2022 285
2020 268
2014 262
2017 256
2016 246
2019 244
2015 243
2018 231
Explanation
- Slightly modified version of the injury by year count.
Observation
Same as with the injuries count. There isn’t any real corresponding downward trend in traffic collision fatalities.
Conclusion
There’s lots more work that can be done to extract meaningful information from this dataset.
What’s clear to me, is that all the political rhetoric and money poured into Vision Zero has yielded little in terms of results.
Most of the solutions are obvious from a logical point of view, but not a political point of view. I walk and cycle these streets and know how dangerous it is to cross at the “designated” crosswalks when cars and trucks are turning in on top of you. Cycling in NYC is even worse.
Some sugggested solutions
- Delayed green lights co cars cars don’t turn in on pedestrians at crosswalks.
- Much higher tax and registraton fees for giant SUV’s and pickup trucks. The don’t belong in the city.
- Better bike lanes, instead of meaningless lines painted on the road.
- Many bike lanes are used as convenient parking for NYPD and delivery vehicles.
- Basic enforcement of traffic laws, which isn’t being done now.
- Drivers ignore red lights, speed limits, noise restrictions etc. when they know they aren’t being enforced.
- Driving while texting or yapping on the phone is the norm, not the exception.
- Drastically improve public transit, especially in areas not served by the subway system.
Some Perl CLI Resources
Peteris Krumins has some great e-books
Dave Cross – From one of his older posts on perl.com