wscraper
wscraper.js is a web scraper agent written in node.js and based on cheerio.js a fast, flexible, and lean implementation of core jQuery; It is built on top of request.js and inspired by http-agent.js;
Usage
There are two ways to use wscraper: http agent mode and local mode.
HTTP Agent mode
In HTTP Agent mode, pass it a host, a list of URLs to visit and a scraping JS script. For each URLs, the agent makes a request, gets the response, runs the scraping script and returns the result of the scraping. Valid usage is:
// scrape a single page from a web sitevar agent = wscraper;agentstart'google.com' '/finance' script; // scrape multiple pages from a websitewscraperstart'google.com' '/' '/finance' '/news' script;
The URLs should be passed as an array of strings. In case only one page needs to be scraped, the URL can be passed as a single string. Null or empty URLs are treated as root '/'. Suppose you want to scrape from http://google.com/finance website the stocks price of the following companies: Apple, Cisco and Microsoft.
// load node.js librariesvar util = ;var wscraper = ;var fs = ; // load the scraping script from a filevar script = fs; var companies = '/finance?q=apple' '/finance?q=cisco' '/finance?q=microsoft'; // create a web scraper agent instancevar agent = wscraper; agent; agent; agent; agent; // run the web scraper agentagentstart'www.google.com' companies script;
The scraping script should be pure client JavaScript, including JQuery selectors. See cheerio.js for details. I should return a valid JavaScript object. The scraping script is passed as a string and usually is read from a file. You can scrape different websites without change any line of the main code: only write different JavaScript scripts. The scraping script is executed in a sandbox using a separate VM context and the script errors are caught without crash of the main code.
At time of writing, google.com/finance website reports financial data of public companies as in the following html snippet:
... 656.06 ...
By using JQuery selectors, we design the scraping script "googlefinance.js" to find the current value of a company stocks and return it as a text:
/* googlefinance.js $ -> is the DOM document to be parsedresult -> is the object containing the result of parsing*/ result = {};price = text;resultprice = price; // result is '656.06'
Local mode
Sometimes, you need to scrape local html files without make a request to a remote server. Wscraper can be used as inline scraper. It takes an html string and a JS scraping script. The scraper runs the scraping script and returns the result of the scraping. Valid usage is:
var scraper = wscraper;scraper;
Only as trivial example, suppose you want to replace the class name of
// load node.js librariesvar util = ;var fs = ;var wscraper = ; // load your html pagevar html = fs; // load the scraping script from a filevar script = fs; // create the scrapervar scraper = wscraper; scraper; scraper; // run the scraperscraper;
By using JQuery selectors, we design the scraping script "replace.js" to find the
/*replace.js $ -> is the DOM document to be parsedresult -> is the final JSON string containing the result of parsinguse var js-obj = JSON.parse(result) to get a js object from the json stringuse JSON.stringify(js-obj) to get back a json string from the js object*/ result = {};var imgs = ;$; resultreplaced = $ || '';
Happy scraping!