Completed Redfin Scraping

3/29/2016

This week was very productive. I spent hours researching why my Redfin scraper appeared to be skipping statements causing it to crash. I finally learned that the reason this was happening was because when a selenium web driver interacts with a webpage is such a way that causes the page to load/reload, we have to also write a script that will stop the script from executing the next statement until the page has completed loading; Selenium does not know to wait for a page to load, so if the page does not load quick enough, the next statement will be executed to soon, and this is exactly what was happening to me. Thanks to Harry J.W. Percival, the author of Test-Driven Development with Python, he covers how to resolve this issue on his blog. I then took what I needed from that post and incorporated it into my script. To test the new wait method, I added a condition to only scrape Arizona properties. After a few hours, the script completed without any errors. I then felt confident enough to allow the script to scrape the remaining states. This process took 5 days to complete (slowness is the downside of using Selenium, because it requires an actual browser to make the calls).

Because the script goes to every city and downloads a unique csv file to a designated folder, I ended up with a folder with almost 9,000 csv files. The next step is to convert all of the data to JSON format, so I needed a way to combine all of the files into one. I did not want to do this manually, so I look into way to automate this process. After researching Pythons libraries and found a way to search for all csv files in a given directory and write each file into a new one. This worked out pretty well and I now have all of the property data I scraped from every state in one file. I have yet to convert the data to JSON.

While the script was running, I took the time to organize all of my scrapers and data, then push them to our Github repository for this project.

2 Comments

Completed Redfin Scraping

Leave a Reply.

Categories

Archives