My goal for this week was to learn how to use the Beautiful Soup libraries in order to intelligently web scrap data from our selected data sources. According to Beautiful Soup, it "is a Python library designed for quick turnaround projects like screen-scraping." Using the methods they provide, we are able to extract data off of any website.
Dr. Bansal provided us with a tutorial to get started with using Beautiful Soup. I ran into a lot of issues installing the python and the Beautiful Soup libraries, because I was trying to complete tutorials that were written for older versions. Some where using old Python syntax, and others were using old Beautiful Soup syntax. It took a lot of researching to update the code in the tutorials I was going over to the newer versions. One example, was many of the tutorials were using the print method in 2.7, and since I am using 3.5, I had to update the print methods to the following format: print(<string>). Once I figured out all of the syntactical issues, I was able to comfortably complete the online tutorials.
I then moved on to try to solve the issue of pulling the data from our desired sources for our real estate advisor application. There is a project on Github that performed a similar task and I felt it would be a useful introduction to a real-world scraper. The script pulls all of the names and addresses for all of the schools in Wisconsin. I was able to successfully compile and run the script, so all I need to do now is configure the code to pull data from all of the states, and gather all of the data, such as:
My goals for this week will be to complete the poster proposal, and if I have time, begin adjusting the code to pull the data from the Great Schools site.