the-scraping-machine

0.0.3 • Public • Published

shieldsIO shieldsIO shieldsIO shieldsIO

The scraping machine

Under development - More news soon

gilling_machine

This is just the beginning of a long journey

Let's make web scraping fun again!

From a JSON Config file... you can create a web scraping script and see the output.

Concept

  1. You just need to define your needs in a JSON file, like demo.json
  2. The you execute node index demo.json in order to start the process in index.js
    • First it validates the arguments and data
    • Then decides the language to use. For now only Python +3 (Beautiful Soup) and Node.js (X-ray) supported
    • Then render all the info in the handlebars template, like templates/python.hbs or templates/node.hbs
  3. The script file is generated, like google.py or google.js
  4. The script will be executed as a process child by Node generating the final output, like google.json

Demo

Inside demo.json:

{
    "source_type": "url",
    "url": "http://google.es",
    "file_name": "google",
    "data": [
        {
            "name": "web-title",
            "type": "selector",
            "query": "title"
        }, {
            "name": "web2",
            "type": "selector",
            "query": "title"
        }
    ]
}

Start the machine

  • For Python script output:
    node index.js demo.json 
    node index.js demo.json python
  • For Node script output
    node index.js demo.json js
    node index.js demo.json node

Output

[
    {
        "web-title": "Google",
        "web2": "Google"
    }
]

Testing

You can test your changes...

npm test

Future Implementations

  • Support for Node.js (X-Ray).
  • Support for CSS3 Selectors.
  • Support for recursive queries.
  • Support for "follow links", like a crawler.
  • Implementation as CLI
  • Basic Testing
  • esLint Support
  • JSDoc Support
  • Basic Gulp Tasks
  • Example Folder

Achievements

v.0.0.3

Features:

  • Added support to JSDoc
  • Added Gulp Tasks
  • Added Basic Testing with Mocha, Chai and Istanbul
  • Added .editorconfig
  • Added esLint support
  • Added example folder
  • Added support to Node.js

Notes: Main target: Improved Proof of concept

v.0.0.2

Features:

  • Roadmap added
  • Added File strucutre
  • Defined a minimal json strcuture
  • Added minimal validation
  • Added a template engine
  • Added support for python
  • Added dynamic information from the setup config file

Notes: Main target: Proof of concept

v.0.0.1

Features:

Notes: Just a "Hello world"

Readme

Keywords

Package Sidebar

Install

npm i the-scraping-machine

Weekly Downloads

2

Version

0.0.3

License

GPL-3.0

Last publish

Collaborators

  • ulisesgascon