dirty-html-content-parser

0.0.10 • Public • Published

nodejs-dirty-html-content-parser

Module for parsing content from dirty HTML.

It uses diff for extracting content fragments from html documents. First, you have to register a reference html document with string position markers defining different types of content. The module uses this reference to find the same type of content in other html documents, by bruteforcing for the smallest diff.

Since the module is just using string diffs, this method works on dirty invalid html.

To reduce the number of diffs to bruteforce, all defined contents must be between tags (see the result in example code below). That can be any kind of tag, an opening tag, closing tag or both. TODO: This must be fixed for version 0.0.0.0.0.1

Yo can define a validator function in the reference, to increase the chanses of proper matching.

var parser = new Parser();
parser.reference('title', {
	html: referenceHtml,
	start: 33431,
	end: 33479,
	validator: function (data) {
		if (data.indexOf('<h1>') === 0) return true;
		return false;
	}
});
parser.reference('author', {
	html: referenceHtml,
	start: 33482,
	end: 33533,
	validator
});
parser.parse(html, function (data) {
	console.dir(data);
	/*
		Example result:
		{
			title: '<h1>Example title</h1>',
			author: '<br />John Doe, Bagarmossen</div>'
		}
	*/
});

Readme

Keywords

none

Package Sidebar

Install

npm i dirty-html-content-parser

Weekly Downloads

0

Version

0.0.10

License

GPLv3

Last publish

Collaborators

  • alfredgodoy