This week, we continued to work on scraping our selected data sources with the goal to scrap Spot Crime's data. The way Spot Crime's site was structured was different to Great Schools, so a different approach would need to be taken to get everything we need. What made this a bit more difficult was how they linked to the specific crime maps we needed. Writing the script to obtain all of the links to the state pages was simple, because the url structure for each state was consistent: spotcrime.com/<state abbreviation>. The difficulty lies on each of the state pages. The page consists of 3 tables of links:
After studying the structure of the site, we figured that our best approach would be to pull the first a tag after every td tag. We should be able to do this, because each state page has the same class name for the tables, so knowing that will allow us to grab all of the tables from each state page. Once we do that, we can use python in conjunction with beautifulsoup libraries to locate the links for each crime map.
Each county's crime map page contains a list of crimes and provides the type, date, and location. When you select a crime instance, you are brought to a detailed description of the crime that lists the time, case number (sometimes), and a summary of the crime. Saul wanted to take on the task of pulling the crime maps and data.
For next week, I plan to figure out a way to gather the lat/long of all of our address, or see if there is another approach that we can take with only addresses. I would also like to move on to scraping our next data source.
This was my first time reading and writing Python code, so I took my time making sure I understand each line of code. The script consisted of three methods: write_to_csv, get_school_urls, and get_school_info. Each method performed their respective tasks (except get_school_urls. It actually gets all the district urls.): The write_to_csv method is called by the other methods in order to save the collected data, get_school_urls gathers the urls for every district in Wisconsin, and get_school_info reads in the data from the district urls gathered from he get_school_urls method, and parses through each link to scrape the school specific data.
Once I felt comfortable with the code, I began brainstorming ways I can adjust it for the needs of our application. After reviewing the write_to_csv method, I found that I will not need to adjust it. The method that required a lot of adjusting was get_school_urls. The issue with this method was that it was only searching Wisconsin (code shown in image 1 below), and the way it did so, was by hard-coding the URL to Wisconsin's homepage:
What I needed to do, was make this method gather the every district in the US. The way I approached this issue, was by studying the url structure for great schools, then stored all of the state URLs in a separate .txt file in the respective format. The text file will then be read in to search all states vs only searching Wisconsin (code shown in image 2 above).
Also, what the original code did was search all of the href properties on each districts homepage for a specific format that would indicate it was a link to a schools homepage. Example:
<pre> / <cur link> / <post>
If a link on the states page contained a link with the above format it is stored in a list, because it was a link to a school district in the state (code shown in image 1 below). In order to adjust the code so it would work for all states, I again chose to create another list of states with the specific format required for a Great School district url, and made sure to check all of the required conditions to ensure it was a valid link (code shown in image 2 below):
After making these adjustments, I was able to run the script and gather the link to every school district in the US and store it within a csv file. The plan for next week is to run the script to obtain the specific school data for our application, and finish writing our poster proposal for the Richard Tapia Celebration of Diversity in Computer Conference.
My goal for this week was to learn how to use the Beautiful Soup libraries in order to intelligently web scrap data from our selected data sources. According to Beautiful Soup, it "is a Python library designed for quick turnaround projects like screen-scraping." Using the methods they provide, we are able to extract data off of any website.
Dr. Bansal provided us with a tutorial to get started with using Beautiful Soup. I ran into a lot of issues installing the python and the Beautiful Soup libraries, because I was trying to complete tutorials that were written for older versions. Some where using old Python syntax, and others were using old Beautiful Soup syntax. It took a lot of researching to update the code in the tutorials I was going over to the newer versions. One example, was many of the tutorials were using the print method in 2.7, and since I am using 3.5, I had to update the print methods to the following format: print(<string>). Once I figured out all of the syntactical issues, I was able to comfortably complete the online tutorials.
I then moved on to try to solve the issue of pulling the data from our desired sources for our real estate advisor application. There is a project on Github that performed a similar task and I felt it would be a useful introduction to a real-world scraper. The script pulls all of the names and addresses for all of the schools in Wisconsin. I was able to successfully compile and run the script, so all I need to do now is configure the code to pull data from all of the states, and gather all of the data, such as:
My goals for this week will be to complete the poster proposal, and if I have time, begin adjusting the code to pull the data from the Great Schools site.
The goal of this week was to improve upon our Ontology through familiarizing ourselves with the data we will be receiving from our data sources. We had met with Dr. Bansal and I was able to obtain feedback regarding the questions that arose when building the ontology. She was able to clarify how we will utilize the sameAs property to link our instances to sources like schema.org or dbpedia. She was also able to explain that what we decide to name our classes does not have to align with prior vocabularies.
Since we decided to revisit building the ontology once we finalize our data set, we went back to locating our data. We were able to obtain API keys to our various sources, and had thought that would be sufficient. The issue we now face our the limitations of the API's. For example the Great Schools API has a 3,000 calls per day limit. This seems very restricted, but since we are just looking to obtain a pool of data, this can still work. The reason this is sufficient is because of the methods they provide. There is a call that searches for nearby schools. We simply need to provide an address and a given radius to obtain a list of schools. If we enter a large enough radius, we should expect to receive a list of all schools in the US, which is exactly what we need. From there, we can realize all of the properties that are included for each school and begin to organize them within our created schema. Not all of the API's we obtained access to provided the same level of data access. I looked into zillow's API in order to obtain lists of properties that are either for sale or rent. Once obtaining the API key, I looked into what types of methods they provided. Zillow API usage was very limited. We are restricted to viewing one property per call. With the number of properties for sale, or for rent, we will quickly meet our call limitation if we were to use this in a live application. There were also similar issues with the API's my group members were researching, so we knew another approach to obtain the data will need to be taken.
Dr. Bansal had provided us a research paper that she was involved with that covered the building of a linked data application. In the paper, they had resorted to screen scraping websites to obtain their desired data. I am familiar with the process, and understand the basic concepts of how it will work, but have never had any hands on experience. There are many techniques that can be used to screen scrape data, and may be done using a number of different programming languages. Instead of diving into screen scraping without any guidance or best practices, I will ask for suggestions and resources fro Dr. Bansal. I hope to be able to familiarize myself with screen scraping and be comfortable writing scripts by the end of this year.
The goal for this week was to begin designing our ontology for our Real Estate Advisor application. Our hierarchy relies on our data sources and the data we will be able to obtain from them. The sources we decided on were ones that provided open web APIs. We have yet to receive keys for all of our sources, so our ontology is still a work-in-progress. I was able to get it started by creating a class hierarchy for our application (see attached). We will wait to add properties and restrictions until we find out exactly what types of data we will be able to utilize from our sources. Some questions that arose while building the ontology.
One question was if the naming conventions of properties. Schema.org has predefined property values for the classes they have listed in their ontology, but what if we want to create a new property? How would that translate for others that may end up referencing our data sets in the future?
The main question was how and where in the implementation process of developing the application will we actually use our ontology? This lead to me researching online to learn how linked data applications have been created. I came across a video series titles the Euclid Project. I really learned a lot through this series and it answered a lot of questions I had, and filled in the gaps as far as my understandings on how linked data applications are built. I learned that the application we are developing can be referred to as a Domain-Specific Web Application, because the data is geared towards a specific use.
The talk in the series goes into detail about the different types of Software Architectures for linked data applications. These architectures were designed to prevent a coupled design which will allow applications to be expanded in the future, if needed, and promote re-usability. One architecture type he discusses is the general three-tiered architecture (Which is something I believe we should implement). The three-tiered architecture consists of the following layers: Presentation Tier, Logic Tier, and the Data Tier. The Presentation tier houses the GUI code for the application. The Logic Tier is where data is intelligently managed between the Presentation and Data tiers; this process is also referred to as Business Logic. Lastly, the Data Tier houses our triple store. The idea of a triple store is what sets linked data applications from other applications. The triple store is generated by obtaining data from external sources, mapping it to our predefined ontology, which forms our triples, then it is stored into our database in its tripled form (hence the name "triple store"). The scripts that are written in our logic tier will query our triple store in a way that will take advantage of the properties that are used to describe our data.
There was a gap in my understanding of how publicly linked data is used and maintained in an application, especially if the data being used needs to be dynamic. I understand that there are public API's that can be utilized in order to perform such tasks, but did not have a full understanding as to how. Also, another thing that was unclear to me was how to keep the data linked that is obtained via API's by applying what we have learned thus far about linked data. Would this require more manual labor to tag the entire data set? I decided to try to looking for examples or tutorials that would help me better understand these concepts. Through my research I have learned that linked data has been underused in the past, because it relied on manual tagging of the data and the tagging of the data was not consistent. This made it time-consuming to use and required screen scraping of individual sites to obtain or manually link the data. There are now services that provide found linked data that act as a repository for data sets. That also brings up more questions, why do we not have one source that all data is being linked to? Does this duplication of data cause gaps between the connections between linked data? These are questions I intend to ask Dr. Bansal for clarification on.
JSON LD is a method to serialize and transfer linked data. This talk really peaked my interest on why someone would use JSON LD vs other methods of transferring data. I found a site dedicated to explaining what JSON LD is and how to use it. The site contained many useful tutorials explaining everything from linked data basics, the issues we face with linked data, and how JSON-LD aims to resolve some of those issues. JSON-LD's main purpose is to resolve the ambiguity among naming schemes from our data sources by giving the data a context - mainly when obtaining data via JSON.
My goal for next week to try creating an API based web application.
We were assigned to complete a tutorial for completing an ontology for Pizza in Protege. Protege is an ontology development environment and is currently on version 5.0. The Pizza Ontology tutorial referenced version 4, so a lot of the steps did not apply to the new version. Instead of downgrading to an older version of protege, I decided to try figuring out how to complete the tutorial with the new layout. The tutorial covered creating a class hierarchy, disjointing classes, creating relationships between classes through object properties, creating an object property hierarchy, inverse properties, functional properties, transitive properties, symmetric properties, property domain/range, and property restrictions. Below is a table I put together that provides and overview of properties:
Dr. Bansal also asked us to begin researching for public data sources that would be useful for the real estate application will be be developing . Before researching, I put together a list of data sources I would like our application to utilize:
Now, locating those data sets online is the real challenge. I'm familiar with websites that are known to provide open data, such as data.gov. data.gov provides many data sets with their API's made publicly available. Another good open data site is freebase.com
Dr. Bansal provided us with a research article done by her and Sebastian Kagemann titled MOOCLink: Building and Utilizing Linked Data from Massive Open Courses. This article was regarding a research project they did to utilize the data from multiple MOOC websites such as Coursera, Udacity, and edX. The document began with providing a background highlighting the guidelines and resources used to build their project.
The section on web-crawling section explained how Coursera's data was gathered through their course catalog API. I have heard the term API getting used a lot both at school and in the web industry, but I had never understood exactly what it was or how it was used. I did a google search for Coursera's Catalog API and found the base URL's. Coursera uses JSON for their data exchange format and provides links to what looks to be a file containing many JSON objects that contain information regarding the courses on their site. This data is made public, so anyone can query their data.
I was also not familiar with web crawlers and have never heard the term screen scrapers, and I began researching on these topics. The amount of information on web-crawlers on the web is a bit overwhelming. Web crawlers can be written in many languages, and there are also numerous web scraping applications; this site contains a good list of apps with a description of it's use.
This paper gave me a better understanding of how we will approach this research project. I was interested in their use of schema.org to find the CreativeWork vocabulary and wanted to learn how to use schema.org's vocabularies, so I visited their site. There, I found a really good Getting Started tutorial that explained how to use their vocabularies. The tutorial covered how to implicitly markup a webpage using html5 tags referred to as MicroData. Along with going through the tutorial on schema.org's website, I also found a Google project on their codelabs site that involves an android application and how to use linked data to integrate with the voice commands of an Android device. A video series was released on youtube that explains the concept of linked data, and how to apply it to the project. The video series also covers an open source graph data base called Cayley that was inspired by Freebase and Google's Knowledge Graph. I learned that Google's Knowledge Graph is a technology that powers the boxes that appear in Google search results that provide direct answers to your search instead of a just a list of links. Cayley provides an interface that displays a graph representation of data that has been structured as triples. This collective information of different methods of using linked data made me aware of the fact that this technology applies to large systems and has virtually infinite uses.
I plan to write some test scripts and use the data I was able to gather and integrate it with the Cayley tool to create a basic knowledge graph.
OWL (Web Ontology Language) and RDF (Resource Description Framework) are the two standards that govern the construction and processing of ontologies. Ontologies are used to provide an understanding of the structure of information through modeling; It is an alternative to viewing source code. Ontologies are made up of two main components: classes and relationships. These two components are what form Triples by connecting the relationship between two classes. Below is an example of a Triple:
RDF was build on top of XML in order to give meaning to the content XML tags. XML did not. RDF implements the idea of triples to create vocabularies. These vocabularies can be referenced via URI's that identify the desired properties. OWL is an extension of RDF that has three sub-languages: OWL Lite, OWL DL, and OWL Full. In their respective order, each language gets more extensive. Both OWL and RDF are written in XML, but what makes OWL different is that it is better interpreted by computers through it's larger vocabulary and stronger syntax. The purpose of OWL is to create web web ontologies to create a "web of data" by making the web's textual content readable by machines. Dr. Bansal has advised us of an open source application that is used to create web-ontologies using the OWL language.