This was my first time reading and writing Python code, so I took my time making sure I understand each line of code. The script consisted of three methods: write_to_csv, get_school_urls, and get_school_info. Each method performed their respective tasks (except get_school_urls. It actually gets all the district urls.): The write_to_csv method is called by the other methods in order to save the collected data, get_school_urls gathers the urls for every district in Wisconsin, and get_school_info reads in the data from the district urls gathered from he get_school_urls method, and parses through each link to scrape the school specific data.
Once I felt comfortable with the code, I began brainstorming ways I can adjust it for the needs of our application. After reviewing the write_to_csv method, I found that I will not need to adjust it. The method that required a lot of adjusting was get_school_urls. The issue with this method was that it was only searching Wisconsin (code shown in image 1 below), and the way it did so, was by hard-coding the URL to Wisconsin's homepage:
What I needed to do, was make this method gather the every district in the US. The way I approached this issue, was by studying the url structure for great schools, then stored all of the state URLs in a separate .txt file in the respective format. The text file will then be read in to search all states vs only searching Wisconsin (code shown in image 2 above).
Also, what the original code did was search all of the href properties on each districts homepage for a specific format that would indicate it was a link to a schools homepage. Example:
<pre> / <cur link> / <post>
If a link on the states page contained a link with the above format it is stored in a list, because it was a link to a school district in the state (code shown in image 1 below). In order to adjust the code so it would work for all states, I again chose to create another list of states with the specific format required for a Great School district url, and made sure to check all of the required conditions to ensure it was a valid link (code shown in image 2 below):
After making these adjustments, I was able to run the script and gather the link to every school district in the US and store it within a csv file. The plan for next week is to run the script to obtain the specific school data for our application, and finish writing our poster proposal for the Richard Tapia Celebration of Diversity in Computer Conference.
My goal for this week was to learn how to use the Beautiful Soup libraries in order to intelligently web scrap data from our selected data sources. According to Beautiful Soup, it "is a Python library designed for quick turnaround projects like screen-scraping." Using the methods they provide, we are able to extract data off of any website.
Dr. Bansal provided us with a tutorial to get started with using Beautiful Soup. I ran into a lot of issues installing the python and the Beautiful Soup libraries, because I was trying to complete tutorials that were written for older versions. Some where using old Python syntax, and others were using old Beautiful Soup syntax. It took a lot of researching to update the code in the tutorials I was going over to the newer versions. One example, was many of the tutorials were using the print method in 2.7, and since I am using 3.5, I had to update the print methods to the following format: print(<string>). Once I figured out all of the syntactical issues, I was able to comfortably complete the online tutorials.
I then moved on to try to solve the issue of pulling the data from our desired sources for our real estate advisor application. There is a project on Github that performed a similar task and I felt it would be a useful introduction to a real-world scraper. The script pulls all of the names and addresses for all of the schools in Wisconsin. I was able to successfully compile and run the script, so all I need to do now is configure the code to pull data from all of the states, and gather all of the data, such as:
My goals for this week will be to complete the poster proposal, and if I have time, begin adjusting the code to pull the data from the Great Schools site.