Studying crime in San Francisco using Maps
We will be looking at the crime data in the city of San Francisco.
• 5 min read
In this blog, we will be looking at the crime data in the city of San Francisco. The data we will be using contains all crimes in San Francisco from the year 2018 to 2020. You can download the data here. Since this dataset is very large (more than 330,000 crimes), we will be considering only a small part of the data for this post.
Note: To view the interactive maps, open the notebook in Google Colab.
import numpy as np
import pandas as pd
import folium
df = pd.read_csv('SF_Crime_data.csv')
df.head()
Incident Datetime | Incident Date | Incident Time | Incident Year | Incident Day of Week | Report Datetime | Row ID | Incident ID | Incident Number | CAD Number | ... | Current Supervisor Districts | Analysis Neighborhoods | HSOC Zones as of 2018-06-05 | OWED Public Spaces | Central Market/Tenderloin Boundary Polygon - Updated | Parks Alliance CPSI (27+TL sites) | ESNCAG - Boundary File | Areas of Vulnerability, 2016 | Unnamed: 36 | Unnamed: 37 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2/3/2020 14:45 | 2/3/2020 | 14:45 | 2020 | Monday | 2/3/2020 17:50 | 89881675000 | 898816 | 200085557 | 200342870.0 | ... | 8.0 | 16.0 | NaN | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN |
1 | 2/3/2020 3:45 | 2/3/2020 | 3:45 | 2020 | Monday | 2/3/2020 3:45 | 89860711012 | 898607 | 200083749 | 200340316.0 | ... | 2.0 | 20.0 | 3.0 | NaN | NaN | NaN | NaN | 2.0 | NaN | NaN |
2 | 2/3/2020 10:00 | 2/3/2020 | 10:00 | 2020 | Monday | 2/3/2020 10:06 | 89867264015 | 898672 | 200084060 | 200340808.0 | ... | 3.0 | 8.0 | NaN | 35.0 | NaN | NaN | NaN | 2.0 | NaN | NaN |
3 | 1/19/2020 17:12 | 1/19/2020 | 17:12 | 2020 | Sunday | 2/1/2020 13:01 | 89863571000 | 898635 | 206024187 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1/5/2020 0:00 | 1/5/2020 | 0:00 | 2020 | Sunday | 2/3/2020 16:09 | 89877368020 | 898773 | 200085193 | 200342341.0 | ... | 6.0 | 30.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN |
5 rows × 38 columns
df.shape
(330054, 38)
There have been more than 330,000 crimes in San Francisco in the past two years.
df.columns
Index(['Incident Datetime', 'Incident Date', 'Incident Time', 'Incident Year', 'Incident Day of Week', 'Report Datetime', 'Row ID', 'Incident ID', 'Incident Number', 'CAD Number', 'Report Type Code', 'Report Type Description', 'Filed Online', 'Incident Code', 'Incident Category', 'Incident Subcategory', 'Incident Description', 'Resolution', 'Intersection', 'CNN', 'Police District', 'Analysis Neighborhood', 'Supervisor District', 'Latitude', 'Longitude', 'point', 'SF Find Neighborhoods', 'Current Police Districts', 'Current Supervisor Districts', 'Analysis Neighborhoods', 'HSOC Zones as of 2018-06-05', 'OWED Public Spaces', 'Central Market/Tenderloin Boundary Polygon - Updated', 'Parks Alliance CPSI (27+TL sites)', 'ESNCAG - Boundary File', 'Areas of Vulnerability, 2016', 'Unnamed: 36', 'Unnamed: 37'], dtype='object')
We do not need all these columns for our analysis. So we will consider only the necessary columns.
df = df[['Incident Datetime', 'Incident Day of Week', 'Incident Number', 'Incident Category', 'Incident Description',
'Police District', 'Analysis Neighborhood', 'Resolution', 'Latitude', 'Longitude', 'point']]
df.head()
Incident Datetime | Incident Day of Week | Incident Number | Incident Category | Incident Description | Police District | Analysis Neighborhood | Resolution | Latitude | Longitude | point | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2/3/2020 14:45 | Monday | 200085557 | Missing Person | Found Person | Taraval | Lakeshore | Open or Active | 37.726950 | -122.476039 | (37.72694991292525, -122.47603947349434) |
1 | 2/3/2020 3:45 | Monday | 200083749 | Stolen Property | Stolen Property, Possession with Knowledge, Re... | Mission | Mission | Cite or Arrest Adult | 37.752440 | -122.415172 | (37.752439644389675, -122.41517229045435) |
2 | 2/3/2020 10:00 | Monday | 200084060 | Non-Criminal | Aided Case, Injured or Sick Person | Tenderloin | Financial District/South Beach | Open or Active | 37.784560 | -122.407337 | (37.784560141211806, -122.40733704162238) |
3 | 1/19/2020 17:12 | Sunday | 206024187 | Lost Property | Lost Property | Taraval | NaN | Open or Active | NaN | NaN | NaN |
4 | 1/5/2020 0:00 | Sunday | 200085193 | Miscellaneous Investigation | Miscellaneous Investigation | Richmond | Pacific Heights | Open or Active | 37.787112 | -122.440250 | (37.78711245591735, -122.44024995765258) |
Now, each row consists of the following 11 features:
- Incident Datetime: The date and time when the incident occurred
- Incident Day of Week: The day of week on which the incident occurred
- Incident Number: The incident or crime number
- Incident Category: The category of the incident or crime
- Incident Desccription: The description of the incident or crime
- Police: The police department district
- Resolution: The resolution of the crime in terms of whether the perpertrator was arrested or not
- Analysis Neighborhoods: The neighborhood where the incident took place
- Latitude: The latitude value of the crime location
- Longitude: The longitude value of the crime location
- point: A tuple of the latitude and logitude values
Let's drop the missing values from the Latitude and Longitude columns as they will result in an error when creating a map.
df.dropna(subset=['Latitude', 'Longitude'], inplace=True)
Rename the Incident Category column for the sake of simplicity.
df.rename(columns={'Incident Category':'Category'}, inplace=True)
df.head()
Incident Datetime | Incident Day of Week | Incident Number | Category | Incident Description | Police District | Analysis Neighborhood | Resolution | Latitude | Longitude | point | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2/3/2020 14:45 | Monday | 200085557 | Missing Person | Found Person | Taraval | Lakeshore | Open or Active | 37.726950 | -122.476039 | (37.72694991292525, -122.47603947349434) |
1 | 2/3/2020 3:45 | Monday | 200083749 | Stolen Property | Stolen Property, Possession with Knowledge, Re... | Mission | Mission | Cite or Arrest Adult | 37.752440 | -122.415172 | (37.752439644389675, -122.41517229045435) |
2 | 2/3/2020 10:00 | Monday | 200084060 | Non-Criminal | Aided Case, Injured or Sick Person | Tenderloin | Financial District/South Beach | Open or Active | 37.784560 | -122.407337 | (37.784560141211806, -122.40733704162238) |
4 | 1/5/2020 0:00 | Sunday | 200085193 | Miscellaneous Investigation | Miscellaneous Investigation | Richmond | Pacific Heights | Open or Active | 37.787112 | -122.440250 | (37.78711245591735, -122.44024995765258) |
5 | 2/3/2020 8:36 | Monday | 200083909 | Miscellaneous Investigation | Miscellaneous Investigation | Central | Financial District/South Beach | Open or Active | 37.796926 | -122.399507 | (37.796926429317054, -122.39950750040278) |
limit = 100
df = df.iloc[0:limit, :]
df.shape
(100, 11)
latitude = 37.7749
longitude = -122.4194
sanfran_map = folium.Map(location=[latitude, longitude], zoom_start=12)
sanfran_map
Let's create clusters of neighborhoods. The number of crimes per clusters is denoted on the cluster circle. In a Jupyter notebook, you can interact with the map - click on a cluster to zoom in, in on a marker to check the category of the crime.
from folium import plugins
# let's start again with a clean copy of the map of San Francisco
sanfran_map = folium.Map(location = [latitude, longitude], zoom_start = 12)
# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(sanfran_map)
# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df.Latitude, df.Longitude, df.Category):
folium.Marker(
location=[lat, lng],
icon=None,
popup=label,
).add_to(incidents)
# display map
sanfran_map