Crawler of Deeds Part 3

Posted August 16, 2016 by Ryan

Actually code!

This is a followup to Part 2 and Part 1 of the Crawler series.

If you are interested in setting something like this up, here are some initial steps to crawling a website using a headless browser, running in node. Note that I am using Windows, but everything should more or less be the same if you are on a mac or linux.

Ensure npm and node are installed.

Create a new directory to set up your project, and initialize your package.json

mkdir crawler
npm init

Install the three libraries we will be using and ensure they are saved to this project

npm i phantomjs-prebuilt -S
npm i casperjs -S
npm i spooky -S

Create you core logic, for now I’m just going to copy Spooky’s example logic

function run() {
try {
var Spooky = require('spooky');
} catch (e) {
console.log(e.message);
e = new Error('Failed to initialize SpookyJS');
e.details = e;
throw e;
}
var spooky = new Spooky({
child: {
transport: 'http'
},
casper: {
logLevel: 'debug',
verbose: true
}
}, function (err) {
if (err) {
console.log(e.message);
e = new Error('Failed to initialize SpookyJS');
e.details = err;
throw e;
}
spooky.start(
'http://en.wikipedia.org/wiki/Spooky_the_Tuff_Little_Ghost');
spooky.then(function () {
this.emit('hello', 'Hello, from ' + this.evaluate(function () {
return document.title;
}));
});
spooky.run();
});
spooky.on('remote.message', function(msg) {
this.log('remote message caught: ' + msg);
});
// Uncomment this block to see all of the things Casper has to say.
// There are a lot.
// He has opinions.
// spooky.on('console', function (line) {
// console.log(line);
// });
spooky.on('hello', function (greeting) {
console.log(greeting);
});
spooky.on('log', function (log) {
if (log.space === 'remote') {
console.log(log.message.replace(/ \- .*/, ''));
}
});
}
function create() {
var result = {};
result.run = run;
return result;
}
module.exports = create;

Create a local entrypoint class

var crawler = require('./crawler-sample.js')();
crawler.run();

Run the sample using node:

node ./crawler-sample-local.js

The end result should be the following output (perhaps with a lot of extra logging if you uncommented casper’s logging addition in the script):

Hello, from Spooky the Tuff Little Ghost - Wikipedia, the free encyclopedia

Troubleshooting

I have not tested fully on linux or mac, but I had to do the following to get this example working locally:

Ensure Casper and Phantom are accessible from the command line

You should be able to run casperjs --version and phantomjs --version and get 1.1.0 and 2.0.0, respectively. If not, you can install them globally using:

npm i phantomjs-prebuilt --global
npm i casperjs --global

or optionally add the projects node_modules/.bin folder to your path (less desirable).

Install tiny-jsonrpc manually

There are a handful of issues throughout Spooky’s github that points to this, and I was forced to on my windows machine

npm i tiny-jsonrpc -S
Back to devlog