clidom

0.1.5 • Public • Published

Parse a DOM from the command line

Ever wanted to do a one-off scrape of a web page? Tired of writing scraping boilerplate for applications and wish you had a nice UNIX-style command to call to do it for you? Enter clidom!

Installation

npm install -g clidom

Usage

clidom selector [URL] [options]

Selector specification:

selector[::subselector]

clidom extends selector syntax to allow you to not only select elements, but mutate the elements as needed. clidom provides the following syntax for subselectors:

  • innerHtml - Returns the element's inner HTML (default)
  • outerHtml - Returns the element's outer HTML
  • text - Returns the text inside the element, stripping tags
  • [attribute] - Returns the value of the specified attribute

Options

-o, --out-file  File to write JSON output              [default: '-' (stdout)]
-p, --pretty    Pretty JSON output                             [default: true]
-t, --trim      Trim empty results                             [default: true]
-h, --help      Show help

Examples

Output a pretty JSON object of button labels on http://www.google.com:

clidom input[type="submit"]::[value] https://www.google.com

Output:

{
  "input[type=submit]::[value]": [
    {
      "value": "Google Search"
    },
    {
      "value": "I'm Feeling Lucky"
    }
  ]
}

Output a pretty JSON object of Twitter usernames talking about node.js:

clidom 'span.username b' 'https://twitter.com/search?f=realtime&q=node.js'

Output: (will vary over time)

{
  "span.username b": [
    "hashedrock",
    "mashupaward",
    "orangesuzuki",
    "RJ_Hsiao",
    "mongodbExpert",
    "JanilsonPy",
    "npm_tweets",
    "nodenpm",
    "npm_tweets",
    "StrongLoop",
    "DevelopersDojo",
    "adstweetbot",
    "Johnny_Rehab",
    "jramonleon",
    "questionjs",
    "AsadNomanMS",
    "amit_intelli",
    "rekkuuzadx",
    "npm_tweets",
    "adstweetbot"
  ]
}

You can of course use this with any other UNIX application, for instance if we wanted to remove all the wrapping JSON, we could do this:

clidom 'span.username b' 'https://twitter.com/search?f=realtime&q=node.js' | jq '.[] | .[]' | tr -d \"

We'd simply get the strings of the users, e.g.:

hashedrock
mashupaward
orangesuzuki
RJ_Hsiao
mongodbExpert
JanilsonPy
npm_tweets
nodenpm
npm_tweets
StrongLoop
DevelopersDojo
adstweetbot
Johnny_Rehab
jramonleon
questionjs
AsadNomanMS
amit_intelli
rekkuuzadx
npm_tweets
adstweetbot

Readme

Keywords

none

Package Sidebar

Install

npm i clidom

Weekly Downloads

5

Version

0.1.5

License

MIT

Last publish

Collaborators

  • cha0s