This weeks goal was to obtain real estate data. I began by researching the most common and well known real estate websites: Zillow, Trulia, and realtor.com. I was hoping one of these sites will have API's that are not too restricted. Zillow's API only allows you to obtain property data for a specified address, so we will not be able to obtain all of the properties that are on the market. Afterwards, I looked into Trulia's API and they only allowed you to review general data regarding cities and states, which will also not help us locating properties that are for sale. I then began looking into realtor.com, but I was not able to locate an API on their website. I learned that getting this data would not be simple, because of the importance these leads have in the world of real estate. I found blog posts on sites like Quora where other people were looking to obtain the same data, and I learned that MLS data is usually only provided to realtors and brokerages that are members of a specific MLS. After learning this, I had to resort to web scraping.
My next goal was to try to find a website with a friendly web-scraper format. Many of the sites only listed a few properties and were paginated, so writing a script for that may be complicated. Also, many of the sites were JavaScript rendered pages, which we are not able to scrape using the beautiful soup and standard Python libraries. I did find one website that provided a unique feature that I was not able to find on any other real estate website. redfin.com allows you to search for by city and returns a list of all properties for sale in the city. Not only does it provide a list, it provides a link to download the entire list in csv format! I was really excited to find this, because I thought it will make writing a scraper much simpler. After studying the sites url structure to see how I would go about getting every city's list, I found that a unique ID was given for each state, which will made writing the scraper a bit more complicated. Since the unique id was being used on the site, I will need to write a script to automate entries on their main search function. I need to pass every city in the US into the search field and then locate the download link on each site. This of course will require me to read JavaScript rendered pages. Since redfin.com contained JavaScript rendered pages, I knew that I would need to research methods that will allow me to scrape it. Saul also ran into this issue with the data he was gathering and was able to locate some tools that are said to scrape JavaScript rendered data. The two he introduced me to was Ghost.py and PhantomJS. After learning about the two, Ghost.py seemed familiar to Selenium, which was a tool I used when I was trying to use a JavaScript tool that provided the latitude and longitude for a given address. I was able to get that to work, but the performance was not good, because it actually opened a automated the steps provided in the script. The tool was mainly used for acceptance testing. What made Ghost.py different was that it performed the same tasks, but with a headless browser which improves performance. So I went through the installation process by installed Ghost.py, then I learned it required and additional library; I can either install pyside, or pyQT. The issue I had was since I was running Python 3.5, neither of those libraries was supported. I did not was to downgrade, because many of my other scripts relied on the 3.5 version. I then tried PhantomJS. PhantomJS is a standalone headless browser, so I can use selenium like I did before, but instead of calling a normal browser like Firefox to test the script, I will be able to use the PhantomJS headless browser. I wrote a quick script to run against the redfin site that will open the site, enter "Tempe, AZ" in the search box, then click on the download link. The script failed when it go to locating the search box by it's class name. I could not find a mistake in the code I wrote, so I tested it using Firefox instead of PhanomJS. Using Firefox it was able to find the search box, enter Tempe, and load the Tempe site with the list and download link, but failed when trying to locate the download link. After researching why PhantomJS would not work, a lot of people that had similar issues found that there was a bug and a way to get around it was by applying a window size to the PhantomJS browser. I tried this, but continued to have the same issue. I was not able to find a solution to the PhantomJS issue, so I decided to stick to selenium and Firefox, but I still needed to figure out the issue not being able to click the download link. After spending some time researching and testing, I found that I made a mistake writing the click statement, so I made the adjustment and confirmed that my code is in fact locating the download link. The issue I had next was getting the click method to actually download the file. I read online that I could try adjusting my browser's download setting to not prompt and automatically store downloads in a specific folder. After trying, it still did not download. I will need to continue to research this issue. Once I am able to get the download link to work, I will be able to gather all properties for sale in the US along with the following data: SALE TYPE, HOME TYPE, ADDRESS, CITY, STATE, ZIP, LIST PRICE, BEDS, BATHS, LOCATION, SQFT, LOT SIZE, YEAR BUILT, PARKING SPOTS, PARKING TYPE, DAYS ON MARKET, STATUS, NEXT OPEN HOUSE DATE, NEXT OPEN HOUSE START TIME, NEXT OPEN HOUSE END TIME, RECENT REDUCTION DATE, ORIGINAL LIST PRICE, LAST SALE DATE, LAST SALE PRICE, URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING), SOURCE, LISTING ID, ORIGINAL SOURCE, FAVORITE, INTERESTED, LATITUDE, LONGITUDE, IS SHORT SALE
25 Comments
9/5/2022 07:30:16 am
Really informative article, I had the opportunity to learn a lot, thank you. https://freecodezilla.net/elementor-pro-nulled-free-download-a/
Reply
9/11/2022 03:51:45 pm
Really informative article, I had the opportunity to learn a lot, thank you. https://kurma.website/
Reply
9/12/2022 02:16:25 am
Really informative article, I had the opportunity to learn a lot, thank you. https://odemebozdurma.com/
Reply
9/30/2022 09:50:59 am
It's great to have this type of content. Good luck with your spirit. Thank you. https://bit.ly/site-kurma
Reply
10/4/2022 10:05:48 pm
I think this post is useful for people. It has been very useful for me. Looking forward to the next one, thank you. https://escortnova.com/escort-ilanlari/yozgat-escort/yerkoy-escort/
Reply
10/5/2022 04:56:59 pm
I follow your posts closely. I can find it thanks to your reliable share. Thank you. https://escortnova.com/escort-ilanlari/denizli-escort/honaz-escort/
Reply
10/6/2022 10:09:36 am
I support your continuation of your posts. I will be happy as new posts come. Thank you. https://escortnova.com/escort-ilanlari/bilecik-escort/
Reply
10/7/2022 12:15:24 am
I think the content is at a successful level. It adds enough information. Thank you. https://escortnova.com/escort-ilanlari/canakkale-escort/biga-escort/
Reply
10/8/2022 03:14:25 am
Thoughtful and real content is shared. Thank you for these shares. https://escortnova.com/escort-ilanlari/bilecik-escort/golpazari-escort/
Reply
11/22/2022 01:43:19 pm
Tıkla evde calismaya basla: https://sites.google.com/view/evden-ek-is/
Reply
12/10/2022 01:46:12 pm
Tiktok takipçi satın almak için tıkla: https://takipcialdim.com/tiktok-takipci-satin-al/
Reply
12/10/2022 01:46:43 pm
İnstagram beğeni takipçi satın al: https://takipcialdim.com/instagram-begeni-satin-al/
Reply
12/16/2022 07:28:44 am
takipçi satın al ve sitemizi ziyaret et: https://takipcim.com.tr/
Reply
12/19/2022 08:58:14 pm
İnstagram takipçi satın almak istiyorsan tıkla.
Reply
1/5/2023 03:59:57 am
100 tl deneme bonusu veren siteleri öğrenmek istiyorsan tıkla.
Reply
Leave a Reply. |
CategoriesArchives
May 2016
|