Analyzing NYC traffic data using q

In my previous post, I analyzed some NYC traffic data using Pandas. In this post, I would like to perform the same analysis using q. As far as graphing is concerned, I won’t be showing that. q is not really used for graphing. You can use GUIs like qPad or qStudio to chart the data on your own.

This post will help you see how one can achieve same results using different methods. It’s nice to have these kind of options. Earlier a major disadvantage for not using q used to be its cost but now you can just use the 32 bit version that is now available for free!

One thing to note is that the analysis done in this post is very straight foreword. Both Pandas and q are capable of handing much more complex analysis. The point of these posts is to show you the different tools available to you.

For my analysis, I will be using the same data set that I used for the earlier post.

Let’s begin!

First, we need to prepare our data before we can start analyzing it.

/ Load the file and define the formats for the columns especially date and time.
a:("DTSIFF****IIIIIIII*****I*****";enlist ",")0::NYPD_Motor_Vehicle_Collisions.csv

/ Replace spaces in column names with underscores
a:(\${ssr[x;" ";"_"]} each string cols a) xcol a

/ Rolf has kindly informed me that there is an existing function called .Q.id that removes spaces.
/ i.e. ZIP CODE will become ZIPCODE
/ Here is how to do it
/ .Q.id a
/ cols .Q.id a will return
/ q)cols .Q.id a
/DATETIMEBOROUGHZIPCODELATITUDELONGITUDELOCATIONONSTREETNAMECROSSSTREETNAME...

/ Delete unwanted columns
delete LATITUDE,LONGITUDE,LOCATION,UNIQUE_KEY,ON_STREET_NAME,CROSS_STREET_NAME,OFF_STREET_NAME from a

With Pandas, I grouped the data by year and summed all the columns. For the sake of better presentation, I will only sum two columns this time:

q)select sum NUMBER_OF_PERSONS_INJURED,sum NUMBER_OF_PERSONS_KILLED by DATE.year from a
year| NUMBER_OF_PERSONS_INJURED NUMBER_OF_PERSONS_KILLED
----| --------------------------------------------------
2012| 27445                     137
2013| 55112                     297
2014| 51198                     262
2015| 20931                     103

Summing the data by month but only for the year 2014 gives:

q)select sum NUMBER_OF_PERSONS_INJURED,sum NUMBER_OF_PERSONS_KILLED by DATE.month from a where DATE.year = 2014
month  | NUMBER_OF_PERSONS_INJURED NUMBER_OF_PERSONS_KILLED
-------| --------------------------------------------------
2014.01| 3920                      23
2014.02| 3106                      13
2014.03| 3853                      18
2014.04| 4048                      22
2014.05| 4658                      26
2014.06| 4985                      22
2014.07| 4603                      29
2014.08| 4476                      22
2014.09| 4464                      28
2014.10| 4718                      20
2014.11| 4242                      26
2014.12| 4125                      13

Again, for the sake of presentation, I am not going to sum all the columns. Here is the data sorted in descending order by the sum of NUMBER_OF_PERSONS_KILLED and grouped by ZIP_CODE

q)NUMBER_OF_PERSONS_KILLED xdesc select sum NUMBER_OF_PERSONS_KILLED,sum NUMBER_OF_PERSONS_INJURED by ZIP_CODE from a
ZIP_CODE| NUMBER_OF_PERSONS_KILLED NUMBER_OF_PERSONS_INJURED
--------| --------------------------------------------------
| 215                      37707
11236   | 18                       1945
11434   | 12                       1448
11203   | 11                       2311
10025   | 10                       602
11206   | 10                       1432
10022   | 9                        834
11207   | 9                        3265
10029   | 8                        919
11101   | 8                        1414
11216   | 8                        1231
11223   | 8                        1034
11230   | 8                        1198
11234   | 8                        1359
11373   | 8                        900
10456   | 7                        900
10460   | 7                        891
10466   | 7                        904
10468   | 7                        1008
11208   | 7                        1724
11365   | 7                        324
11418   | 7                        431
10001   | 6                        856
10461   | 6                        819
11105   | 6                        271
..

Now, we get to graphing.

q)select sum NUMBER_OF_PERSONS_KILLED,sum NUMBER_OF_MOTORIST_KILLED by DATE.year from a
year| NUMBER_OF_PERSONS_KILLED NUMBER_OF_MOTORIST_KILLED
----| --------------------------------------------------
2012| 137                      59
2013| 297                      110
2014| 262                      109
2015| 103                      46

Here is a graph of the above data in qPad.

Grouping the data by BOROUGH for the year 2014

q)select sum NUMBER_OF_PERSONS_KILLED,sum NUMBER_OF_MOTORIST_KILLED by BOROUGH from a where DATE.year=2014
BOROUGH      | NUMBER_OF_PERSONS_KILLED NUMBER_OF_MOTORIST_KILLED
-------------| --------------------------------------------------
| 79                       50
BRONX        | 19                       8
BROOKLYN     | 67                       21
MANHATTAN    | 35                       5
QUEENS       | 52                       20
STATEN ISLAND| 10                       5

Lets look at the total NUMBER_OF_PERSONS_KILLED in Manhattan over the different years

q)select sum NUMBER_OF_PERSONS_KILLED by DATE.year from a where BOROUGH=MANHATTAN
year| NUMBER_OF_PERSONS_KILLED
----| ------------------------
2012| 21
2013| 39
2014| 35
2015| 14

Finally, we look the sum of NUMBER_OF_PERSONS_KILLED in different boroughs where the VEHICLE_TYPE_CODE_1 was “Passenger Vehicle”

q)select sum NUMBER_OF_PERSONS_KILLED by BOROUGH from a where VEHICLE_TYPE_CODE_1 like "PASSENGER VEHICLE"
BOROUGH      | NUMBER_OF_PERSONS_KILLED
-------------| ------------------------
| 92
BRONX        | 35
BROOKLYN     | 79
MANHATTAN    | 32
QUEENS       | 79
STATEN ISLAND| 13

If you are interested, you can find some more advanced qsql queries here.