Overview

Webche provides a node library to scrape basic details of a url using the metascraper and node-unfluff libraries.

Getting Started

Install

npm i webche --save or yarn add webche

Usage

import { scrape } from 'webche';
// or
const { scrape } = require('webche');

// define a url
const url = "http://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion"
const { metascraper, unfluff } = await scrape(url, lazy=true)
 
console.log(metascraper)
 
{
    audio: null,
    author: 'Sarah Jeong',
    logo:
    'https://images.vice.com/motherboard/content-images/article/no-id/1464294050922700.jpg',
    publisher: 'VICE',
    date: '2016-05-26T20:14:00.000Z',
    description:
    'The ruling that Google’s use of APIs is a fair use will have wide-ranging ramifications for the rest of the tech industry.',
    image:
    'https://images.vice.com/motherboard/content-images/article/no-id/1464294050922700.jpg?crop=0.847111111111111xw:1xh;center,center&resize=1200:*',
    lang: 'en',
    title: 'Google Wins Trial Against Oracle, Saves $9 Billion',
    url:
    'https://www.vice.com/en_us/article/kb77gv/google-wins-trial-against-oracle-saves-9-billion',
    video: null
}
 
// The text extraction algorithm can be somewhat slow on large documents. If you only need access to elements like title or image, you can use the lazy extractor to get them more quickly without running the full processing pipeline by setting lazy=true, default value is true. This returns an object just like the regular extractor except all fields are replaced by functions and evaluation is only done when you call those functions.
 
// with lazy = true
console.log(unfluff.title())
 
// with lazy = false
console.log(unfluff.title)
 
console.log(unfluff)
// lazy=false
{
    title: 'Google Wins Trial Against Oracle, Saves $9 Billion',
    softTitle: 'Google Wins Trial Against Oracle, Saves $9 Billion',
    date: '2016-05-26T20:14:00Z',
    author: [ 'Sarah Jeong' ],
    publisher: 'Vice',
    copyright: '2017 VICE Media LLC"',
    favicon:
    'https://vice-web-statics-cdn.vice.com/favicons/vice/favicon.ico',
    description:
    'The ruling that Google\'s use of APIs is a fair use will have wide-ranging ramifications for the rest of the tech industry.',
    keywords:
    'culture, news, lgbtq, politics, journalism, video, documentary, sex, drugs, film, tv, entertainment, travel, crime,tech,Motherboard,copyright,Oracle,motherboard show,fair use,APIs,oracle v. google',
    lang: 'en',
    canonicalLink:
    'https://www.vice.com/en_us/article/kb77gv/google-wins-trial-against-oracle-saves-9-billion',
    tags:
    [ 'Horoscopes',
    'tech',
    'Motherboard',
    'copyright',
    'Oracle',
    'motherboard show',
    'fair use',
    'APIs',
    'oracle v. google' ],
    image:
    'https://images.vice.com/motherboard/content-images/article/no-id/1464294050922700.jpg?crop=0.847111111111111xw:1xh;center,center&resize=1200:*',
    videos: [],
    links: [ [Object], [Object], [Object], [Object] ],
    text:
    'Google just won in Oracle v. Google, a $9 billion case over Android code. At 1:00 PM PDT, a jury of ten people delivered a verdict in favor of Google.\n\nThe lawsuit was first filed in 2010. There was already a trial in 2012, but after an appeal to the Federal Circuit, the parties underwent a second trial over copyrighted code.\n\nAt the end of their third day of deliberation, the jury found that Google\'s use of the declaring code and the structure, sequence, and organization of the Java APIs in the Android code was a fair use.\n\nAfter the verdict was read aloud, Judge William Alsup thanked the jury for their service, noting that the jurors—who often came to court even earlier than the set start time of 7:45 AM, and lingered after hours to pore over their notes—had been "attentive" and "worked hard."\n\n"I salute you for your extreme hard work in this case," he told the jury, which on Tuesday, he had called "the best jury this courthouse has ever seen."\n\nOnce the jury was dismissed, Alsup said, "I know there will be appeals and the like." Oracle is expected to appeal the decision, meaning that this already six-year-long litigation will drag out even longer.\n\nBut still, lawyers for Google were wreathed in smiles after their big victory, laughing and hugging each other as Oracle lawyers huddled grimly on the other side of the courtroom.\n\nThe tech industry has had its eyes on this case, since Google\'s alleged infringement—a clean room reimplementation of the APIs—is a widespread industry practice. Many commentators—and if the tortured analogies that came up at trial are any indication, the lawyers themselves—feared that the jury would not understand the technical issues at the heart of the case. We don\'t know if the jury understood APIs, but the verdict is in: Google\'s use is a fair use.'
}
 
// lazy=true
{
    title: [Function: title],
    softTitle: [Function: softTitle],
    date: [Function: date],
    copyright: [Function: copyright],
    author: [Function: author],
    publisher: [Function: publisher],
    favicon: [Function: favicon],
    description: [Function: description],
    keywords: [Function: keywords],
    lang: [Function: lang],
    canonicalLink: [Function: canonicalLink],
    tags: [Function: tags],
    image: [Function: image],
    videos: [Function: videos],
    text: [Function: text],
    links: [Function: links]
}

Extracted data elements

This is what unfluff will try to grab from a web page:

title - The document's title (from the <title> tag)
softTitle - A version of title with less truncation
date - The document's publication date
copyright - The document's copyright line, if present
author - The document's author
publisher - The document's publisher (website name)
text - The main text of the document with all the junk thrown away
image - The main image for the document (what's used by facebook, etc.)
videos - An array of videos that were embedded in the article. Each video has src, width and height.
tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
canonicalLink - The canonical url of the document, if given.
lang - The language of the document, either detected or supplied by you.
description - The description of the document, from <meta> tags
favicon - The url of the document's favicon.
links - An array of links embedded within the article text. (text and href for each)

This is what metascraper will try to grab from a web page:

audio — eg. https://cf-media.sndcdn.com/U78RIfDPV6ok.128.mp3. A audio URL that best represents the article.
author — eg. Noah Kulwin. A human-readable representation of the author's name.
clearbit - metascraper integration with Clearbit Logo API.
date — eg. 2016-05-27T00:00:00.000Z. An ISO 8601 representation of the date the article was published.
description — eg. Venture capitalists are raising money at the fastest rate... The publisher's chosen description of the article.
image — eg. https://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg. An image URL that best represents the article.
lang — eg. en. An ISO 639-1 representation of the url content language.
logo — eg. https://entrepreneur.com/favicon180x180.png. An image URL that best represents the publisher brand.
publisher — eg. Fast Company. A human-readable representation of the publisher's name.
readability - A Readability connector for metascraper
title — eg. Meet Wall Street's New A.I. Sheriffs. The publisher's chosen title of the article.
url — eg. http://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion. The URL of the article.
video — eg. https://assets.entrepreneur.com/content/preview.mp4. A video URL that best represents the article.
youtube - metascraper integration with YouTube

Todo

Contributors

Rakesh Paul - Xtrios

License

This project is licensed under the MIT License.

webche

Overview

Getting Started

Install

Usage

Extracted data elements

Todo

Contributors

License

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

webche

Overview

Getting Started

Install

Usage

Extracted data elements

Todo

Contributors

License

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads