spindel

2.0.2 • Public • Published

spindel

Build status NPM version XO code style

A web crawler/spider

"spindel" is the Swedish word for spider.

Installation

Install spindel using npm:

npm install --save spindel

Usage

Module usage

Start with single url

const spindel = require('spindel');
 
// Start a crawler at http://example.com:
const stream = spindel('http://example.com');
 
stream.on('data', res => {
    // see response object format below
});

Start with multiple urls

// Start a crawler with an initial queue consisting of two urls:
const stream = spindel([
    'http://example.com',
    'http://another.com'
]);
 
stream.on('data', res => {
    // see response object format below
});

Use a database as url queue

// Start a crawler with a custom queue:
const redisQueue = {
    popUrl() {
        return getNextUrlFromRedisAndReturnAPromise();
    },
    pushUrl(url) {
        return pushUrlToRedisAndReturnAPromise(url);
    }
};
const stream = spindel(redisQueue);
 
stream.on('data', res => {
    // see response object format below
});

API

spindel(urlsOrQueue, options)

Name Type Description
urlsOrQueue String, Array or Object A single url, an array of urls or a queue implementation
options Object The options object

Returns: stream.Readable which emits response objects on the 'data' event.

Options

options.transformHtml

Type: Function
Default: noop

Params:

Name Type Description
body String The response body
url String The url for the page being crawled
res Object The full response object

Return value: Any or Promise<Any>.

For responses containing HTML (i.e. having a content-type which begins with text/ and ends with html) this function will be run and its return value will be set to transformedHtml in the response object.

options.gotOptions

Type: Object
Default: {}

Options passed to got.

Streamed response objects

A response object has the format:

{
    url: String, // the crawled url
    statusCode: Number, // the HTTP status code
    statusMessage: String, // the HTTP status message
    body: String, // the response body
    headers: Object, // the HTTP response headers
    hrefs: Array(String), // found <a href /> urls in the body if content is HTML
    transformedHtml: String // if content is HTML this contains the `body` after applying the `transformHtml` option function
}

Queue implementation

A queue implementation consists of two functions popUrl and pushUrl.

queue.popUrl

Type: function

Params:

Name Type Description
lastUrl String The last crawled url, or null for the first url

Should return: String or Promise<String> to continue crawling or null or Promise<null> to stop crawling.

queue.pushUrl

Type: function

Params:

Name Type Description
href String A found href in the currently crawled response body
referral String The url for the current crawl

Should return: nothing or Promise.

Example of the internal ArrayQueue
function arrayQueue(initialUrls) {
    const urls = initialUrls.slice();
 
    return {
        pushUrl(url) {
            urls.push(url);
        },
        popUrl() {
            return urls.pop();
        }
    };
}

The queue implementation above is used if spindel's urlsOrQueue parameter is a String or Array.

License

MIT © Joakim Carlstein

Package Sidebar

Install

npm i spindel

Weekly Downloads

1

Version

2.0.2

License

MIT

Last publish

Collaborators

  • joakimbeng