webpage-scraper

0.0.3 • Public • Published

Webpage Scraper

I created this project because I've always found use cases where having the ability to quickly and easily iterate through pages in a website and acquire various pieces of information. I've written different tools pretty much identical to this one in nearly every language I've known, just because every time I've needed the tool, I've been more proficient in a different language at the time of need. It can also sometimes be a fun programming exercise.

Installing

npm install webpage-scraper

Usage

The base implementation of the scraper doesn't do anything with the documents it downloads.

In order to get it doing useful stuff, you'll have to either subclass the basic scraper, or provide an instance with a way of extracting whatever content you want from the document.

Example

This snippet pulls in this readme from github, and writes it to /tmp/readme.html.

var WebpageScraper = require("webpage-scraper");

var scraper = new WebpageScraper("/tmp/");

scraper.getContent = function(url, document, next) {
    var readme = document.getElementById("readme");
    return next(null, "readme.html",  readme.toString())
};

scraper.runIndefinitely("https://github.com/Eagerod/webpage-scraper", function(err) {
    if ( err ) {
        throw err;
    }
});

Features

The scraper provides a serial, and a concurrent method of loading up content. The concurrent method isn't as reliable as I'd like it to be right now, but it's reasonably functional.

It currently uses a extremely simple exponential backoff to slow down concurrent requests when you pass the limit you provide, but it would be nice if it used a proper queuing mechanism.

Readme

Keywords

Package Sidebar

Install

npm i webpage-scraper

Weekly Downloads

4

Version

0.0.3

License

MIT

Last publish

Collaborators

  • eagerod