This weeks goal was to obtain real estate data. I began by researching the most common and well known real estate websites: Zillow, Trulia, and realtor.com. I was hoping one of these sites will have API's that are not too restricted. Zillow's API only allows you to obtain property data for a specified address, so we will not be able to obtain all of the properties that are on the market. Afterwards, I looked into Trulia's API and they only allowed you to review general data regarding cities and states, which will also not help us locating properties that are for sale. I then began looking into realtor.com, but I was not able to locate an API on their website. I learned that getting this data would not be simple, because of the importance these leads have in the world of real estate. I found blog posts on sites like Quora where other people were looking to obtain the same data, and I learned that MLS data is usually only provided to realtors and brokerages that are members of a specific MLS. After learning this, I had to resort to web scraping.
I then tried PhantomJS. PhantomJS is a standalone headless browser, so I can use selenium like I did before, but instead of calling a normal browser like Firefox to test the script, I will be able to use the PhantomJS headless browser. I wrote a quick script to run against the redfin site that will open the site, enter "Tempe, AZ" in the search box, then click on the download link. The script failed when it go to locating the search box by it's class name. I could not find a mistake in the code I wrote, so I tested it using Firefox instead of PhanomJS. Using Firefox it was able to find the search box, enter Tempe, and load the Tempe site with the list and download link, but failed when trying to locate the download link. After researching why PhantomJS would not work, a lot of people that had similar issues found that there was a bug and a way to get around it was by applying a window size to the PhantomJS browser. I tried this, but continued to have the same issue. I was not able to find a solution to the PhantomJS issue, so I decided to stick to selenium and Firefox, but I still needed to figure out the issue not being able to click the download link. After spending some time researching and testing, I found that I made a mistake writing the click statement, so I made the adjustment and confirmed that my code is in fact locating the download link. The issue I had next was getting the click method to actually download the file. I read online that I could try adjusting my browser's download setting to not prompt and automatically store downloads in a specific folder. After trying, it still did not download. I will need to continue to research this issue. Once I am able to get the download link to work, I will be able to gather all properties for sale in the US along with the following data:
SALE TYPE, HOME TYPE, ADDRESS, CITY, STATE, ZIP, LIST PRICE, BEDS, BATHS, LOCATION, SQFT, LOT SIZE, YEAR BUILT, PARKING SPOTS, PARKING TYPE, DAYS ON MARKET, STATUS, NEXT OPEN HOUSE DATE, NEXT OPEN HOUSE START TIME, NEXT OPEN HOUSE END TIME, RECENT REDUCTION DATE, ORIGINAL LIST PRICE, LAST SALE DATE, LAST SALE PRICE, URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING), SOURCE, LISTING ID, ORIGINAL SOURCE, FAVORITE, INTERESTED, LATITUDE, LONGITUDE, IS SHORT SALE