Crawler of Deeds Part 1

Posted August 8, 2016 by Ryan

The Problem

I met with local community group to discuss an issue they were having in traversing a county-run website. It’s a story most anyone who has interacted with a deeds or records related government site: it’s time consuming, it’s a terrible user experience, and it’s difficult to answer anything but the simplest questions with the information it provides. What started as a simple scraping program turned into something a lot more interesting, so I decided to keep track of the obstacles and technologies I’m using to solve this issue.

Iterative Development

To start things off I got a handful of use cases and went through the site myself. It was extremely time consuming – each ID I needed to look up and gather information on took about 2 to 5 minutes end-to-end. Sifting through the site a few things were clear:

  • The URL never changed. Which was an extra twist of the knife because it was a damn IP url:
  • Tons of one-off JS scripts were central to this ASP.Net 2.0 application (old, deprecated version as of a decade ago)
  • Huge POST payloads thick with stateful information, and sometimes callbacks that required a redirect
  • A simple series of GET\POST scripts would not do the trick

I wanted to get a foundation of what automating some of these searches would look like. Enter the main tool I was familiar with to automate searching through a site: Selenium. I wrote a quick and dirty python script to get my feet wet. It was pretty rough: a gaggle of IDs and Xpaths, clicks, and waiting for the js to populate some box on the page. Eventually I trudged through to a solution with a Selenium webdriver that could only use Chrome.

The rough script of Selenium in python looked like this:

def search_record_of_deeds_pin(self, rawPin, fileLock):
driver = self.driver
driver.get(self.base_url + "/i2/default.aspx?AspxAutoDetectCookieSupport=1")
pin = rawPin.split("-")
log("Collecting data for PIN {}".format(rawPin))
# Enter pin and search
for i in range(5):
elemName = "SearchFormEx1_PINTextBox" + str(i)
# Get all result rows
searchResults = driver.find_elements_by_class_name("DataGridRow") + driver.find_elements_by_class_name("DataGridAlternatingRow")
jsDocLinks = []
# Iterate each row, and extract the necessary javascript to run to get each document's details
for element in searchResults:
docTypeChild = element.find_element_by_xpath('.//td[4]/a')
docType = docTypeChild.text
# For now, just grab MORTGAGEs and WARRENTY DEEDs
if ("MORTGAGE" in docType) or ("WARRANTY DEED" in docType):
attr = docTypeChild.get_attribute('href').replace('javascript:', '') + ';'
docNumber = element.find_element_by_xpath('.//td[5]/a').text
result = {}
result['link'] = attr
result['docNumber'] = docNumber
result['docType'] = docType
deeds = []
# For each relevant row, extract the rest of the details
for document in jsDocLinks:
result = driver.execute_script(str(document['link']))
self.waitForIdTextToMatch('DocDetails1_GridView_Details_ctl02_ctl00', document['docNumber'])
newRecord = DeedRecord("-".join(pin), document['docNumber'], document['docType'])
newRecord.executedDate = parse(self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl01', ''))
newRecord.recordedDate = parse(self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl02', ''))
newRecord.amount = self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl05', '')
# Grantors and grantees take a little more finesse
grantElement = driver.find_element_by_id('DocDetails1_GrantorGrantee_Table')
numGrantors = grantElement.find_element_by_xpath('.//tbody/tr[1]/td/span').text
numGrantees = grantElement.find_element_by_xpath('.//tbody/tr[3]/td/span').text
for i in range(int(numGrantors[len(numGrantors)-1])):
newRecord.grantors.append(self.getTextFromId('DocDetails1_GridView_Grantor_ctl0{}_ctl00'.format(str(2 + i)), ''))
for i in range(int(numGrantees[len(numGrantees)-1])):
newRecord.grantees.append(self.getTextFromId('DocDetails1_GridView_Grantee_ctl0{}_ctl00'.format(str(2 + i)), ''))
# Sort and save to a csv file
deeds.sort(key=lambda x: x.executedDate)
self.outputToCsv(deeds, fileLock)

Special call out to line 35, where I had to

Realizing no one wanted to run a python (or any) script to search this info, as well as a need to display it in some user-friendly fashion, I started thinking about how to get some basic web hosting of a program like this. I wanted to experiment with a simple API crawler service but what a nightmare with my current stack: requiring Chrome to be installed, kicking off a selenium process, and return the results. I needed something that could run headless with a smaller footprint and smaller setup.

Enter PhantomJS. PhantomJS provides a headless, scriptable browser with a lot of the goodies you expect from Selenium: screen capture, DOM API, and many selector options. I started down this path, but it was becoming a pain because the API is still pretty low level.

Back to devlog