spyder - Indexer and scraper runner
spyder
provides basic architecture for running indexers and scrapers. It comes as a CLI tool which you then provide configuration using either -c
or --config
parameter (spyder -c ./config.js
).
It is also possible to provide a directory with spyder_config.js
in it. In that case spyder
tries to load the configuration automatically. Example: spyder demo
.
You can also pass additional parameters to both commands. They will override the default configuration.
As a Module
spyder
can also be used as a regular Node module. It expects configuration as its parameter like this:
var spyder = ; ;
config.js
Consider the following config.js
for basic configuration:
moduleexports = // workers initializer: // optional indexer: scraper: // events onError: onResult: onFinish: // other variance: 5000 // variance between scrape operations in ms;
Workers
spyder
provides three workers into which you may attach actual functionality. initializer
is executed once when spyder
process is started. You may set auth keys and such there. indexer
is run once per scraping round. scraper
is executed per each url returned by indexer
.
Initializer
initializer
is optional. A basic implementation could look like this:
module { // do something with o now // ... ; // done};
The first parameter will contain arguments passed to spyder
process. This behavior is the same for all workers.
Indexer
An indexer
could look like this:
module { // index some page or pages here // once finished, cb ;};
Remember to return the urls you want to scrape here. In case you run into error, pass it as the first parameter to the callback.
Scraper
A scraper
could look like this:
module { // scrape the content from url now // once finished, cb ;};
The same idea as earlier applies here. First the function receives arguments passed to spyder, then url to scrape and finally a callback to call when finished.
Events
In case an error is received, module defined at onError
is defined. When a scraping result is received, onResult
module is invoked. Once the whole process has finished, onFinished
is invoked. Like above each handler receives arguments. You can for instance inject an object there at initializer
and then use that to perform some operation. To give you an idea of what these files should look like, consider the following.
./error.js
:
module { // let's just log errors for now // this is also the default behavior. if you don't provide a handler, // spyder defaults to this console;};
./result.js
:
module { // got some scraping result now, do something with it // spyder defaults to console.log (handy during development) console; // the callback is optional and allows you to communicate possible errors ;};
./finish.js
:
module { // spyder default console;}
Other
variance
- Usevariance
to add arbitrary, random delay between scrape operations to make traffic look more irregular.
License
spyder
is available under MIT. See LICENSE for more details.