schabbi-webscraper

Lightweight and easy to use webcrawler.

Features

Fast and reliable
Supports custom page handling
Result contains also all cookies
Accepts all puppeteer parameters

Requirements

NodeJS v15.*

Installation

via NPM

$ npm i schabbi-webscraper

via Github

$ git clone https://github.com/PatrickSchababerle/schabbi-webscraper
$ npm install

Usage

Standard use case

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();      

Crawler.setUrl('https://www.example.com').crawl();

With custom option parameters

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();

Crawler.setUrl('https://www.example.com').withOptions({
    includeExternalLinks :  true,
    userAgent :  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
    authentication : {
        username : 'Testuser',
        password : 'Test'
    }
}).crawl();

You can decide which crawled links are added to the queue by using the queue option. F.e. to crawl only pages with a specific attribute, class or target:

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();

Crawler.setUrl('https://www.digitalsterne.de').withOptions({
    queue : {
        pattern : 'a[href*="/2021/05/06"]'
    }
}).crawl();

You can also decide if parameters are ignored when adding urls to the queue:

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();

Crawler.setUrl('https://www.digitalsterne.de').withOptions({
    ignoreUrlParameter : true
}).crawl();

Work with the crawled pages while the're beeing processed

With custom functions you can perform actions on each crawled page. The results will be pushed into the final results.

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();

Crawler.setUrl('https://digitalsterne.de').eachPage(async (page) => {
    const links = await page.$$eval('a', as => as.map(a => a.href));
    return links;
}).crawl().then((result) => {
    console.log(result);
});

Further work with result

Schabbi is returning a promise which will be resolved as soon as the crawl has finished:

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();

Crawler.setUrl('https://www.example.com').crawl().then((result) => {
    console.log(result);
});

Methods

Method	Description
withOptions( Object )	Set custom options for the crawler
setUrl( String )	Set initial url
crawl()	Start the crawling

Configuration

Option	Description	Type
includeExternalLinks	Determine if Schabbi should output external links in the results	BOOLEAN
userAgent	Use a custom User Agent for crawling	STRING
browser	Settings for Puppeteer. All Puppeteer browser launch arguments are accepted	OBJECT
queue	Set custom pattern for evaluation of links inside crawled pages.	OBJECT

Visit the examples for detailed information on how to use options properly.

About this project

This is one of my first projects on github to be available for you all out there. Please feel free to provide feedback!

schabbi-webscraper

Features

Requirements

Installation

via NPM

via Github

Usage

Standard use case

With custom option parameters

Work with the crawled pages while the're beeing processed

Further work with result

Methods

Configuration

About this project

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

schabbi-webscraper

Features

Requirements

Installation

via NPM

via Github

Usage

Standard use case

With custom option parameters

Work with the crawled pages while the're beeing processed

Further work with result

Methods

Configuration

About this project

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads