Analyzing NYC traffic data using Pandas

I wrote a post earlier about how to analyze data using Pandas. In that post, I just introduced some simple Pandas functions to analyze some random data. In this post, I am using real world traffic data from NYC. NYC has made a lot of data available to public recently on this website.  There is plenty of quality data for you to play around with. I chose to look at the traffic data which you can download from here.

I performed the analysis using iPython Notebook which I have embedded here.

This cream when applied on the penis and heart muscles and helps viagra soft pills visit to find out more to cure the aspermia caused by orchitis or vesiculitis. Chitrak: This is a powerful cialis tadalafil 100mg digestive and carminative herb that is known for its effectiveness in addressing digestive issues. cialis 5mg discount With the utilization of Forzest remedy men can carry out powerful hardons that are robust until peak. Some STDs like genital wart viruses may increase the complications.If you have suffered with serious heart disease, have had a recent heart attack, have had a recent stroke, sickle cell anemia, low or uncontrolled free viagra sample high blood pressure Have a problem with your blood vessels say thank you when you face flushes after eating a curry.

In [226]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 12)
In [227]:
# Load a csv directly into a dataframe
# parse_dates parameter converts the DATE and TIME columns into DATETIME format, resulting in a combined column called DATE_TIME
df = pd.read_csv("c:\NYPD_Motor_Vehicle_Collisions.csv",parse_dates=[['DATE', 'TIME']])

# Remove spaces from column names
cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str, unicode)) else x)
df.columns = cols
In [228]:
# Delete the columns we don't care about
df = df.drop(['LATITUDE', 'LONGITUDE', 'LOCATION', 'UNIQUE_KEY','ON_STREET_NAME','CROSS_STREET_NAME','OFF_STREET_NAME'], axis=1)
In [229]:
# Lets do some aggregation
# Find total by year
df.groupby(df.DATE_TIME.dt.year).sum()
Out[229]:
ZIP_CODE NUMBER_OF_PERSONS_INJURED NUMBER_OF_PERSONS_KILLED NUMBER_OF_PEDESTRIANS_INJURED NUMBER_OF_PEDESTRIANS_KILLED NUMBER_OF_CYCLIST_INJURED NUMBER_OF_CYCLIST_KILLED NUMBER_OF_MOTORIST_INJURED NUMBER_OF_MOTORIST_KILLED
2012 836355612 27445 137 5907 72 2210 6 19328 59
2013 1682824973 55112 297 11984 176 4074 11 39054 110
2014 1688700297 51198 262 11027 133 3997 20 36171 109
2015 759746044 20931 103 4284 53 1319 4 15328 46
In [230]:
# Now we will look at data for a specific year
# Find total number of persons injured and killed by month in 2014
df[df.DATE_TIME.dt.year == 2014].groupby(df.DATE_TIME.dt.month).sum()[['NUMBER_OF_PERSONS_INJURED','NUMBER_OF_PERSONS_KILLED']]
Out[230]:
NUMBER_OF_PERSONS_INJURED NUMBER_OF_PERSONS_KILLED
1 3920 23
2 3106 13
3 3853 18
4 4048 22
5 4658 26
6 4985 22
7 4603 29
8 4476 22
9 4464 28
10 4718 20
11 4242 26
12 4125 13
In [231]:
# Sort the data in descending order on NUMBER_OF_PERSONS_KILLED by zip code
df.groupby(df.ZIP_CODE).sum().sort(['NUMBER_OF_PERSONS_KILLED'],ascending=0)

# Looks like Zip Code 11236 is the most dangerous
Out[231]:
ZIP_CODE NUMBER_OF_PERSONS_INJURED NUMBER_OF_PERSONS_KILLED NUMBER_OF_PEDESTRIANS_INJURED NUMBER_OF_PEDESTRIANS_KILLED NUMBER_OF_CYCLIST_INJURED NUMBER_OF_CYCLIST_KILLED NUMBER_OF_MOTORIST_INJURED NUMBER_OF_MOTORIST_KILLED
ZIP_CODE
11236 56123820 1945 18 304 7 67 0 1574 11
11434 52722174 1448 12 180 7 43 0 1225 5
11203 61818154 2311 11 404 4 123 0 1784 7
10025 30445925 602 10 227 9 91 1 284 0
11206 49127104 1432 10 363 6 255 1 814 3
10022 66987048 834 9 351 9 125 0 358 0
... ... ... ... ... ... ... ... ... ...
10307 4225870 90 0 11 0 0 0 79 0
10282 400998 6 0 6 0 0 0 0 0
10281 236463 3 0 1 0 2 0 0 0
10280 894360 14 0 3 0 6 0 5 0
10111 111221 0 0 0 0 0 0 0 0
11697 713517 19 0 1 0 6 0 12 0

190 rows × 9 columns

In [232]:
# Graph number of pedestrians/motorist killed by year
df.groupby(df.DATE_TIME.dt.year).sum()[['NUMBER_OF_PEDESTRIANS_KILLED','NUMBER_OF_MOTORIST_KILLED']].plot(kind='bar')

# 2013 was the worst year for pedestrian deaths. 
Out[232]:
<matplotlib.axes._subplots.AxesSubplot at 0x9fa47160>
In [233]:
# Lets look at year 2014 and see how the deaths varied by Borough
df.groupby(df.BOROUGH).sum()[['NUMBER_OF_PEDESTRIANS_KILLED','NUMBER_OF_MOTORIST_KILLED']].plot(kind='bar')

# As we can see, most number of pedestrians died in Brooklyn than any other. 
# Staten Island had the least number of deaths for both pedestrians and motorists. Surprised?
Out[233]:
<matplotlib.axes._subplots.AxesSubplot at 0x94532940>
In [240]:
# How did the deaths of pedestrians vary by year in Manhattan?
df[df.BOROUGH == 'MANHATTAN'].groupby(df.DATE_TIME.dt.year).sum()[['NUMBER_OF_PEDESTRIANS_KILLED']].plot(kind='bar')
Out[240]:
<matplotlib.axes._subplots.AxesSubplot at 0x941e0668>
In [237]:
# Distribution of deaths by borough which have VEHICLE_TYPE_CODE_1 as passenger vehicle
df[df.VEHICLE_TYPE_CODE_1 == 'PASSENGER VEHICLE'].groupby(df.BOROUGH).sum()[['NUMBER_OF_PERSONS_KILLED']].plot(kind='bar')
Out[237]:
<matplotlib.axes._subplots.AxesSubplot at 0x86b6f2b0>

I will try to do a similar analysis with q at some time in future (minus the graphs, of course).

Join the Conversation

2 Comments

  1. i need help plz. i tried to run this code on my jupyter notebook python 3 but it did not work. please help

    1. Hi Adam – In order for me to help you, I first need to know what part of the code didn’t work and what error you are getting.

Leave a comment

Your email address will not be published. Required fields are marked *