robots.js
robots.js — is parser for robots.txt files for node.js.
Installation
It's recommended to install via npm:
$ npm install -g robots
Usage
Here's an example of using robots.js:
var robots = parser = ; parser;
Default crawler user-agent is:
Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0
Here's an example of using another user-agent and more detailed callback:
var robots = parser = 'http://nodeguide.ru/robots.txt' 'Mozilla/5.0 (compatible; RobotTxtBot/1.0)' after_parse ; { ifsuccess parser; };
Here's an example of getting list of sitemaps:
var robots = parser = ; parser;
Here's an example of getCrawlDelay usage:
var robots = parser = ; // for example: // // $ curl -s http://nodeguide.ru/robots.txt // // User-agent: Google-bot // Disallow: / // Crawl-delay: 2 // // User-agent: * // Disallow: / // Crawl-delay: 2 parser;
An example of passing options to the HTTP request:
var options = headers: Authorization:"Basic " + "username:password" var robots = parser = null options; parser;
API
RobotsParser — main class. This class provides a set of methods to read, parse and answer questions about a single robots.txt file.
- setUrl(url, read) — sets the URL referring to a robots.txt file. by default, invokes read() method. If read is a function, it is called once the remote file is downloaded and parsed, and it takes in two arguments: the first is the parser itself, and the second is a boolean which is True if the the remote file was successfully parsed.
- read(after_parse) — reads the robots.txt URL and feeds it to the parser
- parse(lines) — parse the input lines from a robots.txt file
- canFetch(userAgent, url, callback) — using the parsed robots.txt decide if
userAgent can fetch url. Callback function:
function callback(access, url, reason) { ... }
where:- access — can this url be fetched. true/false.
- url — target url
- reason — reason for
access
. Object:- type — valid values: 'statusCode', 'entry', 'defaultEntry', 'noRule'
- entry — an instance of
lib/Entry.js:
. Only for types: 'entry', 'defaultEntry' - statusCode — http response status code for url. Only for type 'statusCode'
- canFetchSync(userAgent, url) — using the parsed robots.txt decide if userAgent can fetch url. Return true/false.
- getCrawlDelay(userAgent) — returns Crawl-delay for the certain userAgent
- getSitemaps(sitemaps) — gets Sitemaps from parsed robots.txt
- getDisallowedPaths(userAgent) — gets paths explictly disallowed for the user agent specified AND *
License
See LICENSE file.