This week, I continued work on trying to solve the issue reading the data I received from Great Schools in the desired format. After studying and testing the csv and python libraries I was finally able to figure out how to pull only the address in the format I needed.
As I had originally thought, the issue was due to how the data in the csv file was formatted. The way I was able to solve this issue was by reformatting it to a tuple structure. By reformatting it the following format, I was able to access the address without it splitting the city, state and zip:
'District_Name', 'School_Name', 'Address'
What I was not aware of before was how the data was formatted after being read by the csv reader. When I opened the csv file in excel, or in my text editor, it was in the following format: ['<District>','<school>','<Address>'], but after it was read by the csv reader, it was wrapped by brackets and quotes ["['<District>','<school>','<Address>']"], which were causing the issue of printing in the incorrect format. I realized this by printing each row in the csv as it was being read; before, I was trying to only print the address and was not seeing the extra characters in the output. What I needed to do was remove the extra brackets and quotes from both ends. I was able to find some methods in pythons library that allowed me to complete this task: join, and string manipulation. Below is a snippet of the code used to reformat the csv data:
with open('school_data.csv', "r") as f:
data = csv.reader(f)
for row in data:
school_data = ','.join(row)
school_data = str(school_data)[2:-2].split("', '")
I tested this by using a sample data set of school information (for more time efficient testing). By getting it to the tuple format, I was able to print/access the full address, because python was able to differentiate the delimiter from the string elements.
Since I was able to successfully obtain the address from the great schools data, I was able to look into writing the script to make use of Google's Geocoder API. This was a fairly simple task, because Google's geocoder html requests are straight forward. I needed to concurrently read in all of the addresses from the great schools data, make the html geocode request using python's requests library, then appending the found data to respective school's data in my Great Schools data-set.
As I was writing the script, I was thinking about how our triple store will be implemented/created later in the project. I thought it would be nice to start working on tagging our data (make it machine readable) now, vs. having to tackle the issue later. Earlier in my research, I learned about JSON. There are tools like JSON-LD that would make our job of creating our triple store much easier. Since what I currently have is just the data, and not the corresponding identifier for each piece of data, I looked into how to convert it to JSON's key value pair format. After some time of researching how this can be done, I was able to accomplish this task by constructing a dictionary as I was parsing the csv file. Once I finalized writing the csv content to a dictionary format, I was able to use the json.dump() method to create the json encoded file.
Finally, I was ready to replace the previous address information gathered from great schools to the geocoder location data. The benefit of having the geocoder location data for each of "places" is that it is constructed in a way that allows for more effective queries (street number, name, county, state, country, latitude, longitude are all individually tagged). Using their format we will be able to show data per state, county, city etc. which I feel would provide beneficial data for our users. In order to complete this task, I just needed to use python's requests library to make Google's geocoder html request by assigning a base address to a variable, then looping through my list of addresses produced earlier, store the json result into my created dictionary replacing the old location data from Great Schools:
r = requests.get(base_addr+content)
geocoder_result = r.json()
"District" : content,
"School" : content,
"Location" : geocoder_result