Blog Posts

February 09th, 2016

2/8/2016

This week, we continued to work on scraping our selected data sources with the goal to scrap Spot Crime's data. The way Spot Crime's site was structured was different to Great Schools, so a different approach would need to be taken to get everything we need. What made this a bit more difficult was how they linked to the specific crime maps we needed. Writing the script to obtain all of the links to the state pages was simple, because the url structure for each state was consistent: spotcrime.com/<state abbreviation>. The difficulty lies on each of the state pages. The page consists of 3 tables of links:

Arizona Crime Maps, Statistics, Local Reports and Alerts
More Areas
Additional Arizona Crime Resources and Links

The first table contained links to county crime maps, most wanted list, and daily reports. Most wanted and daily reports contain a subset of what is on the crime map, so we only need to obtain the crime map for each county. The second table contains a list of the remaining county crime maps, and the third list crime maps for the major universities. The issue is the with way the first table is constructed; there are tables within tables, and each id for a crime map link has a unique id (name of the respective county). Not knowing the id makes this task difficult, because we no longer can parse the site for the specific tag we need and pull the content. We now have to study the structure of the webpage and figure out a way to get to the tables and pull all of the crime map links using beautiful soup.

After studying the structure of the site, we figured that our best approach would be to pull the first a tag after every td tag. We should be able to do this, because each state page has the same class name for the tables, so knowing that will allow us to grab all of the tables from each state page. Once we do that, we can use python in conjunction with beautifulsoup libraries to locate the links for each crime map.

Each county's crime map page contains a list of crimes and provides the type, date, and location. When you select a crime instance, you are brought to a detailed description of the crime that lists the time, case number (sometimes), and a summary of the crime. Saul wanted to take on the task of pulling the crime maps and data.

Since Saul took the lead on Spot Crime, I decided to focus on an issue I realized after reviewing the data we have received so far. In order for our application to be user friendly, it needs a map that displays schools, crime, and place within a given radius of their selected property. Many of the tools that can do this task require a latitude and longitude to be able to calculate the distance between two points. We do not have the latitude and longitude for the addresses we gathered so far, so I began looking for ways to get that information. I decided to take what I have learned so far with screen scraping and try to automate an online address to lat/long conversion tool. My idea would be to provide a list of address, and retrieve the corresponding lats/longs. I was not able to locate a conversion tool that did not use Javascript rendered pages. Because of this, I was not able to pull the data from the webpage like we did with others. Also, Beautiful Soup does not have the capability to submit forms, so I had to research other libraries that may be able to perform this task. After researching I found some information on Selenium. It is a tool used for functional/acceptance testing in web development. I was able to get the tool to work, but the issue was that I used a web-driver that actually ran the script live using a selected browser (opened a web-browser and performed the tasks I had written in the script). It was cool to watch the automated browser, but actually running the script for thousands of address would take way too long.

For next week, I plan to figure out a way to gather the lat/long of all of our address, or see if there is another approach that we can take with only addresses. I would also like to move on to scraping our next data source.

Categories

Archives