This was my first time reading and writing Python code, so I took my time making sure I understand each line of code. The script consisted of three methods: write_to_csv, get_school_urls, and get_school_info. Each method performed their respective tasks (except get_school_urls. It actually gets all the district urls.): The write_to_csv method is called by the other methods in order to save the collected data, get_school_urls gathers the urls for every district in Wisconsin, and get_school_info reads in the data from the district urls gathered from he get_school_urls method, and parses through each link to scrape the school specific data. Once I felt comfortable with the code, I began brainstorming ways I can adjust it for the needs of our application. After reviewing the write_to_csv method, I found that I will not need to adjust it. The method that required a lot of adjusting was get_school_urls. The issue with this method was that it was only searching Wisconsin (code shown in image 1 below), and the way it did so, was by hard-coding the URL to Wisconsin's homepage: What I needed to do, was make this method gather the every district in the US. The way I approached this issue, was by studying the url structure for great schools, then stored all of the state URLs in a separate .txt file in the respective format. The text file will then be read in to search all states vs only searching Wisconsin (code shown in image 2 above). Also, what the original code did was search all of the href properties on each districts homepage for a specific format that would indicate it was a link to a schools homepage. Example: <pre> / <cur link> / <post> http://www,greatschools.org/wisconsin/schools/ If a link on the states page contained a link with the above format it is stored in a list, because it was a link to a school district in the state (code shown in image 1 below). In order to adjust the code so it would work for all states, I again chose to create another list of states with the specific format required for a Great School district url, and made sure to check all of the required conditions to ensure it was a valid link (code shown in image 2 below): After making these adjustments, I was able to run the script and gather the link to every school district in the US and store it within a csv file. The plan for next week is to run the script to obtain the specific school data for our application, and finish writing our poster proposal for the Richard Tapia Celebration of Diversity in Computer Conference.
0 Comments
Leave a Reply. |
CategoriesArchives
May 2016
|