google-crawler

0.1.0 • Public • Published

Google Crawler

This project is an effort to turn a publicly available paste into an NPM package. The original paste is available at the following URL:

It's an express middleware that will spit raw HTML to Google's crawler according to their specification:

It allows indexing Javascript heavy applications (SPA) by providing an HTML rendering of pages when they are requested with a special _escaped_fragment_ parameter.

It relies on a PhantomJS backend to run the frontend's Javascript.

Installation

This module is available through NPM:

npm install --save google-crawler

Usage

var express = require('express');
var google_crawler = require('google-crawler');

var server = imports.express();

server.use(google_crawler({
  scraper: 'http://scraper.example.com/img/'
}));

// Continue setting things up..

On your frontend, you'll want to include the following element:

<meta name="fragment" content="!">

Configuration

The middleware accepts the following parameters:

  • shebang: a boolean to determine wheter or not to build URLs with a shebang.
  • scraper: an URL pointing to the PhantomJS backend.

Sample backend

PhantomJS backends are expected to be built with phantom-crawler:

Here's a sample crawler:

phantom.injectJs('crawler/crawler.js');

new Crawler()
  .chrome()
  .debug()
  .crawl(function () {

    return [
      '<!DOCTYPE html>',
      '<html>',
        document.head.outerHTML,
        document.body.outerHTML,
      '</html>'
    ].join('\n');

  })
  .serve(require('system').env.PORT || 8888);

Dependencies (1)

Dev Dependencies (0)

    Package Sidebar

    Install

    npm i google-crawler

    Weekly Downloads

    4

    Version

    0.1.0

    License

    ISC

    Last publish

    Collaborators

    • saalaa