web-tree-crawl

1.1.4 • Public • Published

Note to english-speakers

Many comments, issues, etc. are partially written in german. If you want something translated create an issue and I'll take care of it.

Introduction

web-tree-crawl is available on npm and on gitlab

Idea

The crawling process is tree-shaped: You start with a single URL (the root), download a document (a node) and discover new URLs (child nodes) which in turn will be downloaded. So every crawled document is a node in the tree and every URL is an edge. The tree spans only over new edges; edges to already known urls will be stored, but not processed.

The end result will be a tree representing the crawl-process. All discovered information will be stored in this tree.

The main difference between different crawler is which URLs and which data will be scraped from discovered documents. So those two scraper will need to be supplied by the user while the library web-tree-crawl takes care of everything else.

Example

note: everything here is ECMA6

So lets say you want the last couple of comics from xkcd.com. All you have to do is

"use strict";
const crawler = require('web-tree-crawl');

// configure your crawler
let ts = new crawler("https://xkcd.com");
ts.config.maxRequests = 5;
ts.config.dataScraper = crawler.builtin.dataScraper.generalHtml;
ts.config.urlScraper = crawler.builtin.urlScraper.selectorFactory('a[rel="prev"]');

// exectute!
ts.buildTree(function (root) {
    //  print discovered data to std::out
    console.log(JSON.stringify(crawler.builtin.treeHelper.getDataAsFlatArray(root), null, "\t"));
});

For more examples see: https://gitlab.com/wotanii/web-tree-crawl/tree/master/examples

Details/Dokumentation

Use web-tree-crawl like this:

  1. create the object & set the initial url
  2. modify the config-object
  3. call buildTree & wait for the callback

Config

You will always want to define those config-attributes :

  • maxRequests: how many documents may be crawled?
  • dataScraper: what data do you want to find?
  • urlScraper: how does the crawler look for new urls?

There are more, but their defaults work well on most websites and are pretty much self-explanatory (if not, let me know by opening an issue).

Url Scraper

These are functions, that scrape urls from a document. The crawler will apply this function to all crawled documents to discover new documents.

Create your own url scraper or use a builtin. All url scraper must have this signature:

  • parameters
    1. string: content of current document
    2. string: url of current document
  • returns
    1. string[]: discovered urls

Data Scraper

These are functions, that scrape data from a document. The crawler will apply this function to all crawled documents to decide what data to store for this document.

The crawler will not use this data in anyway, so you can return whatever you want.

Create your own data scraper or use a builtin. All data scraper must have this signature:

  • parameters
    1. string: content of current document
    2. string: current node
  • returns
    1. anything

Builtin

There are some static builtin function, that you don't need to use, but they will make your life easier. Some of those function can be used directly and some are factories, that return those functions.

Url Scraper

These are functions, that will scrape for urls in a usual manner. Use them by putting them in your config like this:

ts.config.urlScraper = treeCrawler.builtin.urlScraper.selectorFactory('a[rel="prev"]');

Data Scraper

These are functions, that will scrape for data in a usual manner. Use them by putting them in your config like this:

ts.config.dataScraper = treeCrawler.builtin.dataScraper.generalHtml;

Tree Helper

These are functions, that help to extract information from the result-tree. Use these function once buildTree has finished.

They will modify your tree (e.g. treeCrawler.builtin.treeHelper.addParentsToNodes) or they will extract data from your tree (e.g. crawler.builtin.treeHelper.getDataAsFlatArray)

Dev-Setup

sudo apt install npm nodejs

git clone git@gitlab.com:wotanii/web-tree-crawl.git
cd web-tree-crawl/
npm install

npm test

if tests fail with your set-up, either create an issue or comment on an existing issue

Package Sidebar

Install

npm i web-tree-crawl

Weekly Downloads

8

Version

1.1.4

License

JSON

Last publish

Collaborators

  • wotanii