Gumbo Parser
Using google's gumbo parser to parse HTML in node.
var gumbo = ;var tree = ;
Usage
There's only one method: gumbo(htmlstring)
.
You can also pass in the options
;
returns:
// if you use normal document mode: document: // the document element (see below) root: // the html element (se 'Element' below) // if you use fragment parsing: childNodes: list Element: same as tagname 1 normalized to lowercase original text from tag original closing tag from original text if there was one -> replicating childNodes rather than children ie all text / comment children are included "HTML" "SVG" or "MATHML" -> if element is inserted by parser this value is undefined TextNode: #text or #cdata-section 3 note: In DOM3 CDATA is marked as nodeType 4 However after checking that neither firefox chrome nor safari marks CDATA as and that CDATA is gone in DOM4 i decided to stick with the futuristic alternative Document: #document 9 hasDoctype true/false name: string -> see below " systemIdentifier (string) " CommentNode #comment 8 content comment same as textcontent Attribute name: attribute name value: attribute nodeType: number 2 nameStart: position nameEnd: position valueStart: position valueEnd: position Position line: number column: number offset: number
About html doctypes
An html document will always have the document.name
"html".
If the document has anything else in the type, for example this html4 doctype:
the first part within quotation marks will end up in the document.publicIdentifier
,
and the second part will be in document.systemIdentifier
. You can read more about this here: http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#syntax-doctype.
Untrusted content / XSS cleaning
If you plan on using gumbo-parser to clean user input, the gumbo parser is one of the most well-tested and audited parsers available. Please read this comment from the gumbo-parsers authors.. There's a node module for XSS cleaning with the gumbo parser. Check Gumbo-Sanitize out!
Node 0.8
Contrary to what i previously said, node-gumbo-parser does build under node 0.8. You might have to npm update -g npm
though.
Build and test:
node-gyp configurenode-gyp buildnpm test
Changes
0.2.2 Update to use the latest NaN api, so it works for node 4.0
0.2.1 Celebrating some new stuff with a MINOR version change * Fragment parsing supports fragmentContext and fragmentNamespace Uses version 0.10.1, Big changes from the gumbo-parser-team: * Fragment parsing (instead my homebrew fragment parsing, the gumbo c-lib now supports fragments) * Parses all html5lib tests including template * 30-40% speed improvement See all changes here
0.1.13 Upgrade C lib Uses version 0.9.3, CDATA handling (see note in docs) See all changes here
0.1.12 io.js support! Thanks a lot to MicroMike
0.1.11 Upgrade C lib Uses version 0.9.2, performance improvements, duplicate attributes, semicolon fix, See all changes here
0.1.10 Visual Studio bugfix Thanks takenspc
0.1.9 Experimental fragment parsing Expose node positions from the parser, which also enables the user to see if an element is inserted by the parser or was in the text Update gumbo parser to a more secure version Update statement about security
0.1.8 Fix for BSD build problem
0.1.7 Fixes for build on snow leopard
0.1.6 Adding originalTag, originalTagName and tagNamespace if the tag is unknown, parse originalTag and set in as tag
0.1.5 Updating the gumbo-parser to the latest version. This includes some security fixes, and if you use this for user content, please update.
0.1.4 Temporary workaround for the latest changes in node 0.11, thanks Daniel
0.1.3 Fixes utf-8 bug, thanks Yonatan
0.1.2 Taking the (optional) options argument providing publicIdentifier and systemIdentifer for the doctype
0.1.1 Fix build on node 0.8
0.1.0 Passing { document: document, root: root } instead of only root