flexible

Easily build flexible, scalable, and distributed, web crawlers.

npm install flexible
11 downloads in the last week
45 downloads in the last month

This project will soon be superseded by node-web-crawler.

Flexible Web Crawler

Easily build flexible, scalable, and distributed, web crawlers for node.

Simple Example

var flexible = require('flexible');

// Initiate a crawler. Chainable.
var crawler = flexible('http://www.example.com/')
    .use(flexible.pgQueue('postgres://postgres:1234@localhost:5432/'))

    .route('*/search?q=', function (req, res, body, doc, next) {
        console.log('Search results handled for query:', req.params.q);
    })
    .route('*/users/:name', function (req, res, body, doc, next) {
        crawler.navigate('http://www.example.com/search?q=' + req.params.name);
    })
    .route('*', function (req, res, body, doc, next) {
        console.log('Every other document is handled by this route.');
    })

    .on('complete', function () {
        console.log('All of the queued locations have been crawled.');
    })

    .on('error', function (error) {
        console.error('Error:', error.message);
    });

Features

  • Asynchronous friendly, and evented, API for easily building flexible, scalable, and distributed web crawlers.
  • An array based queue for small crawls, and a PostgreSQL based queue for massive, and efficient, crawls.
  • Uses a fast, lightweight, and forgivable, HTML parser to ensure proper document compatibility for crawling.
  • Component system; use different queues, a router (wildcards, placeholders, etc), and other components.

Installation

npm install flexible

Or from source:

git clone git://github.com/eckardto/flexible.git 
cd flexible
npm link

Complex Example / Demo

flexible 

Crawl the web using Flexible for node.
Usage: node [...]/flexible.bin.js

Options:
  --url, --uri                  URL of web page to begin crawling on.                        [string]  [required]
  --domains, -d                 List of domains to allow crawling of.                        [string]
  --interval, -i                Request interval of each crawler.                          
  --encoding, -e                Encoding of response body for decoding.                      [string]
  --max-concurrency, -m         Maximum concurrency of each crawler.                       
  --max-crawl-queue-length, -M  Maximum length of the crawl queue.                         
  --user-agent, -A              User-agent to identify each crawler as.                      [string]
  --timeout, -t                 Maximum seconds a request can take.                        
  --follow-redirect             Follow HTTP redirection responses.                           [boolean]
  --max-redirects               Maximum amount of redirects.                               
  --proxy, -p                   An HTTP proxy to use for requests.                           [string]
  --controls, -c                Enable pause (ctrl-p), resume (ctrl-r), and abort (ctrl-a).  [boolean]  [default: true]
  --pg-uri, --pg-url            PostgreSQL URI to connect to for queue.                      [string]
  --pg-get-interval             PostgreSQL queue get request interval.                     
  --pg-max-get-attempts         PostgresSQL queue max get attempts.

API

flexible([options])

Returns a configured, navigated and or with crawling started, crawler instance.

new flexible.Crawler([options])

Returns a new Crawler object.

Crawler#use([component], [callback])

Configure the crawler to use a component.

Crawler#navigate(url, [callback])

Process a location, and have the crawler navigate (queue) to it.

Crawler#crawl([callback])

Have the crawler crawl (recursive).

Crawler#pause()

Have the crawler pause crawling.

Crawler#resume()

Have the crawler resume crawling.

Crawler#abort()

Have the crawler abort crawling.

Events

  • navigated (url) Emitted when a location has been successfully navigated (queued) to.
  • document (doc) Emitted when a document is finished being processed by the crawler.
  • paused Emitted when the crawler has paused crawling.
  • resumed Emitted when the crawler has resumed crawling.
  • complete Emitted when all navigated (queued) to locations have been crawled.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

npm loves you