Crawler of Deeds Part 1
The Problem
I met with local community group to discuss an issue they were having in traversing a county-run website. It’s a story most anyone who has interacted with a deeds or records related government site: it’s time consuming, it’s a terrible user experience, and it’s difficult to answer anything but the simplest questions with the information it provides. What started as a simple scraping program turned into something a lot more interesting, so I decided to keep track of the obstacles and technologies I’m using to solve this issue.
Iterative Development
To start things off I got a handful of use cases and went through the site myself. It was extremely time consuming – each ID I needed to look up and gather information on took about 2 to 5 minutes end-to-end. Sifting through the site a few things were clear:
- The URL never changed. Which was an extra twist of the knife because it was a damn IP url: http://162.217.184.82/i2/default.aspx?AspxAutoDetectCookieSupport=1
- Tons of one-off JS scripts were central to this ASP.Net 2.0 application (old, deprecated version as of a decade ago)
- Huge POST payloads thick with stateful information, and sometimes callbacks that required a redirect
- A simple series of GET\POST scripts would not do the trick
I wanted to get a foundation of what automating some of these searches would look like. Enter the main tool I was familiar with to automate searching through a site: Selenium. I wrote a quick and dirty python script to get my feet wet. It was pretty rough: a gaggle of IDs and Xpaths, clicks, and waiting for the js to populate some box on the page. Eventually I trudged through to a solution with a Selenium webdriver that could only use Chrome.
The rough script of Selenium in python looked like this:
def search_record_of_deeds_pin(self, rawPin, fileLock): | |
driver = self.driver | |
driver.delete_all_cookies() | |
driver.get(self.base_url + "/i2/default.aspx?AspxAutoDetectCookieSupport=1") | |
pin = rawPin.split("-") | |
log("Collecting data for PIN {}".format(rawPin)) | |
# Enter pin and search | |
for i in range(5): | |
elemName = "SearchFormEx1_PINTextBox" + str(i) | |
driver.find_element_by_id(elemName).send_keys(pin[i]) | |
driver.find_element_by_id("SearchFormEx1_btnSearch").click() | |
# Get all result rows | |
searchResults = driver.find_elements_by_class_name("DataGridRow") + driver.find_elements_by_class_name("DataGridAlternatingRow") | |
jsDocLinks = [] | |
# Iterate each row, and extract the necessary javascript to run to get each document's details | |
for element in searchResults: | |
docTypeChild = element.find_element_by_xpath('.//td[4]/a') | |
docType = docTypeChild.text | |
# For now, just grab MORTGAGEs and WARRENTY DEEDs | |
if ("MORTGAGE" in docType) or ("WARRANTY DEED" in docType): | |
attr = docTypeChild.get_attribute('href').replace('javascript:', '') + ';' | |
docNumber = element.find_element_by_xpath('.//td[5]/a').text | |
result = {} | |
result['link'] = attr | |
result['docNumber'] = docNumber | |
result['docType'] = docType | |
jsDocLinks.append(result) | |
deeds = [] | |
# For each relevant row, extract the rest of the details | |
for document in jsDocLinks: | |
result = driver.execute_script(str(document['link'])) | |
self.waitForIdTextToMatch('DocDetails1_GridView_Details_ctl02_ctl00', document['docNumber']) | |
newRecord = DeedRecord("-".join(pin), document['docNumber'], document['docType']) | |
newRecord.executedDate = parse(self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl01', '')) | |
newRecord.recordedDate = parse(self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl02', '')) | |
newRecord.amount = self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl05', '') | |
# Grantors and grantees take a little more finesse | |
grantElement = driver.find_element_by_id('DocDetails1_GrantorGrantee_Table') | |
numGrantors = grantElement.find_element_by_xpath('.//tbody/tr[1]/td/span').text | |
numGrantees = grantElement.find_element_by_xpath('.//tbody/tr[3]/td/span').text | |
for i in range(int(numGrantors[len(numGrantors)-1])): | |
newRecord.grantors.append(self.getTextFromId('DocDetails1_GridView_Grantor_ctl0{}_ctl00'.format(str(2 + i)), '')) | |
for i in range(int(numGrantees[len(numGrantees)-1])): | |
newRecord.grantees.append(self.getTextFromId('DocDetails1_GridView_Grantee_ctl0{}_ctl00'.format(str(2 + i)), '')) | |
deeds.append(newRecord) | |
# Sort and save to a csv file | |
deeds.sort(key=lambda x: x.executedDate) | |
self.outputToCsv(deeds, fileLock) |
Special call out to line 35, where I had to
Realizing no one wanted to run a python (or any) script to search this info, as well as a need to display it in some user-friendly fashion, I started thinking about how to get some basic web hosting of a program like this. I wanted to experiment with a simple API crawler service but what a nightmare with my current stack: requiring Chrome to be installed, kicking off a selenium process, and return the results. I needed something that could run headless with a smaller footprint and smaller setup.
Enter PhantomJS. PhantomJS provides a headless, scriptable browser with a lot of the goodies you expect from Selenium: screen capture, DOM API, and many selector options. I started down this path, but it was becoming a pain because the API is still pretty low level.