spindel
A web crawler/spider
"spindel" is the Swedish word for spider.
Installation
Install spindel
using npm:
npm install --save spindel
Usage
Module usage
Start with single url
const spindel = ; // Start a crawler at http://example.com:const stream = ; stream;
Start with multiple urls
// Start a crawler with an initial queue consisting of two urls:const stream = ; stream;
Use a database as url queue
// Start a crawler with a custom queue:const redisQueue = { return ; } { return ; };const stream = ; stream;
API
spindel(urlsOrQueue, options)
Name | Type | Description |
---|---|---|
urlsOrQueue | String , Array or Object |
A single url, an array of urls or a queue implementation |
options | Object |
The options object |
Returns: stream.Readable
which emits response objects on the 'data'
event.
Options
options.transformHtml
Type: Function
Default: noop
Params:
Name | Type | Description |
---|---|---|
body | String |
The response body |
url | String |
The url for the page being crawled |
res | Object |
The full response object |
Return value: Any
or Promise<Any>
.
For responses containing HTML (i.e. having a content-type which begins with text/
and ends with html
) this function will be run and its return value will be set to transformedHtml
in the response object.
options.gotOptions
Type: Object
Default: {}
Options passed to got
.
Streamed response objects
A response object has the format:
url: String // the crawled url statusCode: Number // the HTTP status code statusMessage: String // the HTTP status message body: String // the response body headers: Object // the HTTP response headers hrefs: ArrayString // found <a href /> urls in the body if content is HTML transformedHtml: String // if content is HTML this contains the `body` after applying the `transformHtml` option function
Queue implementation
A queue implementation consists of two functions popUrl
and pushUrl
.
queue.popUrl
Type: function
Params:
Name | Type | Description |
---|---|---|
lastUrl | String |
The last crawled url, or null for the first url |
Should return: String
or Promise<String>
to continue crawling or null
or Promise<null>
to stop crawling.
queue.pushUrl
Type: function
Params:
Name | Type | Description |
---|---|---|
href | String |
A found href in the currently crawled response body |
referral | String |
The url for the current crawl |
Should return: nothing or Promise
.
Example of the internal ArrayQueue
{ const urls = initialUrls; return { urls; } { return urls; } ;}
The queue implementation above is used if spindel's urlsOrQueue
parameter is a String
or Array
.
License
MIT © Joakim Carlstein