Blackfall - Crawler of Deeds #1

The Problem

I met with local community group to discuss an issue they were having in traversing a county-run website. It’s a story most anyone who has interacted with a deeds or records related government site: it’s time consuming, it’s a terrible user experience, and it’s difficult to answer anything but the simplest questions with the information it provides. What started as a simple scraping program turned into something a lot more interesting, so I decided to keep track of the obstacles and technologies I’m using to solve this issue.

Iterative Development

To start things off I got a handful of use cases and went through the site myself. It was extremely time consuming – each ID I needed to look up and gather information on took about 2 to 5 minutes end-to-end. Sifting through the site a few things were clear:

The URL never changed. Which was an extra twist of the knife because it was a damn IP url: http://162.217.184.82/i2/default.aspx?AspxAutoDetectCookieSupport=1
Tons of one-off JS scripts were central to this ASP.Net 2.0 application (old, deprecated version as of a decade ago)
Huge POST payloads thick with stateful information, and sometimes callbacks that required a redirect
A simple series of GET\POST scripts would not do the trick

I wanted to get a foundation of what automating some of these searches would look like. Enter the main tool I was familiar with to automate searching through a site: Selenium. I wrote a quick and dirty python script to get my feet wet. It was pretty rough: a gaggle of IDs and Xpaths, clicks, and waiting for the js to populate some box on the page. Eventually I trudged through to a solution with a Selenium webdriver that could only use Chrome.

The rough script of Selenium in python looked like this:

                      
                        def search_record_of_deeds_pin(self, rawPin, fileLock):
                      
                          driver = self.driver
                      
                          driver.delete_all_cookies()
                      
                          driver.get(self.base_url + "/i2/default.aspx?AspxAutoDetectCookieSupport=1")
                      
                          pin = rawPin.split("-")
                      
                          log("Collecting data for PIN {}".format(rawPin))
                      
                          # Enter pin and search
                      
                          for i in range(5):
                      
                              elemName = "SearchFormEx1_PINTextBox" + str(i)
                      
                              driver.find_element_by_id(elemName).send_keys(pin[i])
                      
                          driver.find_element_by_id("SearchFormEx1_btnSearch").click()
                      
                          # Get all result rows
                      
                          searchResults = driver.find_elements_by_class_name("DataGridRow") + driver.find_elements_by_class_name("DataGridAlternatingRow")
                      
                          jsDocLinks = []
                      
                          # Iterate each row, and extract the necessary javascript to run to get each document's details
                      
                          for element in searchResults:
                      
                              docTypeChild = element.find_element_by_xpath('.//td[4]/a')
                      
                              docType = docTypeChild.text
                      
                              # For now, just grab MORTGAGEs and WARRENTY DEEDs
                      
                              if ("MORTGAGE" in docType) or ("WARRANTY DEED" in docType):
                      
                                  attr = docTypeChild.get_attribute('href').replace('javascript:', '') + ';'
                      
                                  docNumber = element.find_element_by_xpath('.//td[5]/a').text
                      
                                  result = {}
                      
                                  result['link'] = attr
                      
                                  result['docNumber'] = docNumber
                      
                                  result['docType'] = docType
                      
                                  jsDocLinks.append(result)
                      
                          deeds = []
                      
                          # For each relevant row, extract the rest of the details
                      
                          for document in jsDocLinks:
                      
                              result = driver.execute_script(str(document['link']))
                      
                              self.waitForIdTextToMatch('DocDetails1_GridView_Details_ctl02_ctl00', document['docNumber'])
                      
                              newRecord = DeedRecord("-".join(pin), document['docNumber'], document['docType'])
                      
                              newRecord.executedDate = parse(self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl01', ''))
                      
                              newRecord.recordedDate = parse(self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl02', ''))
                      
                              newRecord.amount = self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl05', '')
                      
                              # Grantors and grantees take a little more finesse
                      
                              grantElement = driver.find_element_by_id('DocDetails1_GrantorGrantee_Table')
                      
                              numGrantors = grantElement.find_element_by_xpath('.//tbody/tr[1]/td/span').text
                      
                              numGrantees = grantElement.find_element_by_xpath('.//tbody/tr[3]/td/span').text
                      
                              for i in range(int(numGrantors[len(numGrantors)-1])):
                      
                                  newRecord.grantors.append(self.getTextFromId('DocDetails1_GridView_Grantor_ctl0{}_ctl00'.format(str(2 + i)), ''))
                      
                              for i in range(int(numGrantees[len(numGrantees)-1])):
                      
                                  newRecord.grantees.append(self.getTextFromId('DocDetails1_GridView_Grantee_ctl0{}_ctl00'.format(str(2 + i)), ''))
                      
                              deeds.append(newRecord)
                      
                          # Sort and save to a csv file
                      
                          deeds.sort(key=lambda x: x.executedDate)
                      
                          self.outputToCsv(deeds, fileLock)

view raw deeds-search-partial.py hosted with

by GitHub

Special call out to line 35, where I had to

Realizing no one wanted to run a python (or any) script to search this info, as well as a need to display it in some user-friendly fashion, I started thinking about how to get some basic web hosting of a program like this. I wanted to experiment with a simple API crawler service but what a nightmare with my current stack: requiring Chrome to be installed, kicking off a selenium process, and return the results. I needed something that could run headless with a smaller footprint and smaller setup.

Enter PhantomJS. PhantomJS provides a headless, scriptable browser with a lot of the goodies you expect from Selenium: screen capture, DOM API, and many selector options. I started down this path, but it was becoming a pain because the API is still pretty low level.

	def search_record_of_deeds_pin(self, rawPin, fileLock):
	driver = self.driver
	driver.delete_all_cookies()
	driver.get(self.base_url + "/i2/default.aspx?AspxAutoDetectCookieSupport=1")
	pin = rawPin.split("-")
	log("Collecting data for PIN {}".format(rawPin))

	# Enter pin and search
	for i in range(5):
	elemName = "SearchFormEx1_PINTextBox" + str(i)
	driver.find_element_by_id(elemName).send_keys(pin[i])
	driver.find_element_by_id("SearchFormEx1_btnSearch").click()

	# Get all result rows
	searchResults = driver.find_elements_by_class_name("DataGridRow") + driver.find_elements_by_class_name("DataGridAlternatingRow")
	jsDocLinks = []

	# Iterate each row, and extract the necessary javascript to run to get each document's details
	for element in searchResults:
	docTypeChild = element.find_element_by_xpath('.//td[4]/a')
	docType = docTypeChild.text
	# For now, just grab MORTGAGEs and WARRENTY DEEDs
	if ("MORTGAGE" in docType) or ("WARRANTY DEED" in docType):
	attr = docTypeChild.get_attribute('href').replace('javascript:', '') + ';'
	docNumber = element.find_element_by_xpath('.//td[5]/a').text
	result = {}
	result['link'] = attr
	result['docNumber'] = docNumber
	result['docType'] = docType
	jsDocLinks.append(result)

	deeds = []
	# For each relevant row, extract the rest of the details
	for document in jsDocLinks:
	result = driver.execute_script(str(document['link']))
	self.waitForIdTextToMatch('DocDetails1_GridView_Details_ctl02_ctl00', document['docNumber'])
	newRecord = DeedRecord("-".join(pin), document['docNumber'], document['docType'])
	newRecord.executedDate = parse(self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl01', ''))
	newRecord.recordedDate = parse(self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl02', ''))
	newRecord.amount = self.getTextFromId('DocDetails1_GridView_Details_ctl02_ctl05', '')

	# Grantors and grantees take a little more finesse
	grantElement = driver.find_element_by_id('DocDetails1_GrantorGrantee_Table')
	numGrantors = grantElement.find_element_by_xpath('.//tbody/tr[1]/td/span').text
	numGrantees = grantElement.find_element_by_xpath('.//tbody/tr[3]/td/span').text
	for i in range(int(numGrantors[len(numGrantors)-1])):
	newRecord.grantors.append(self.getTextFromId('DocDetails1_GridView_Grantor_ctl0{}_ctl00'.format(str(2 + i)), ''))
	for i in range(int(numGrantees[len(numGrantees)-1])):
	newRecord.grantees.append(self.getTextFromId('DocDetails1_GridView_Grantee_ctl0{}_ctl00'.format(str(2 + i)), ''))
	deeds.append(newRecord)

	# Sort and save to a csv file
	deeds.sort(key=lambda x: x.executedDate)
	self.outputToCsv(deeds, fileLock)

Crawler of Deeds Part 1

The Problem

Iterative Development