bolero
Web crawler for Node and browsers. It can either be run standalone within a page, be remotely controlled by Node, or just be in Node. Using DOM operation to scrap data is available when it is running within a page.
Installation
npm install --save bolero
Usage
- To run it in browser, add
bolero.js
to your code, and install tentacle.user.js(a user-script file) in your browser.
var Crawler = boleroCrawler linkExt = boleroextractorlinkExtractor crawler crawler = url: 'http://example.com' // or an object. see detail below // this is a default callback for all urls. optional { // extract and return your data. } // called after callback to do some common operations. optional {} // called if all urls are fetched. optional }) crawlercrawlercrawler // finally, callcrawler// call crawler.pause() if needed
- To use a function to handle DOM in browser, just assign the function named
domCallback
to the callback function. The result will be attached toresponse.domResult
callback { // This function would be transformed as a string which would be passed to // another window to be evaluated. So it is a SPECIAL function which scope // will be confirmed in future. return documentinnerHTML}
- To run it in Node with no relationship to browsers, just use require to get the constructor. Use it like above except having no domCallback.
var Crawler = crawler = /**/
- To run it in Node, and fetch data through browser(domCallback is available).
var Crawler = // Node2browser will open chrome to run a browserCrawler in the page // 'http://localhost:9998'. For now, it just support chrome alone. But, // you could open another browser installed with tentacle.user.js to visit // 'http://localhost:9998' to continue it manually. crawler = name: 'node2browser-adapter'
Licence
MIT