normalize.js

0.0.1 • Public • Published

Normalize.js

Nothing to see yet, just some research for now.

Research

Keywords

  • table detection
  • list detection
  • ad detection
  • training data
  • feature detection
  • automatic annotation
  • web content extraction
  • wrapper induction
  • structured data extraction (wrapper generation)
  • Google Sets
  • WebTables
  • template detection
  • schema-mapping problem
  • ordered tree isomorphism
  • layout-based document clustering
  • web information extraction (WIE)
  • site-level template detection
  • page-level template detection
  • visual block extraction
  • noise elimination
  • VIPS algorithm
  • noise-block detection
  • block segmentation algorithm
  • informative content
  • information retrieval
  • Visual Clustering Extractor (VCE)
  • block tree construction
  • NIT (Node Information Threshold)
  • determining page layout
  • maximal semantic block
  • page segmentation
  • content preparation
  • visual area classification
  • web document analysis (WDA)
  • support vector machine (SVM)
  • principle segmentation (process of segmenting a web page into macro elements "header", "footer", "sidebar", "main content")
  • visual boundary detection in html

Table Detection

List Detection

Ad Detection

Sentence Detection

Page Template / Fragment Detection

Checks all pages for a domain and figures out which DOM nodes are statistically most likely part of the template. That info can then be used to remove things like sidebars, menus, and ads.

Content Extraction

Image Boundary Detection

Take a picture of the webpage with PhantomJS and compute pixel locations of boundaries, then use that info to help traversing the DOM.

Semantic Annotation

Code

JSDom is a headless browser for Node.js. Cheerio allows for DOM traversal but doesn't allow JS/CSS (or visual) stuff.

Other

Tools

Notes

  • Feature selection is a crucial step in any machine learning based methods.
  • Content block can be identified based on the appearance of the same block in multiple web pages. The algorithm first partitions the web page into blocks based on different HTML tags. The algorithm then classifies each block as either a content block or a non-content block. The algorithm compares a block, B, with the stored block to check whether it is similar to a stored one, if so then it is not necessary to store that block again.
  • Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site’s Web-pages changes.
  • MAMA (Metadata Analysis and Mining Application) is a structural Web page search engine from Opera Software that crawls Web pages and returns results detailing page structures. If we look into MAMA’s key findings, we see that the average website has a table structure nested three levels deep. On the list of 10 most popular tags, table, td and tr are all there. The table element is found on over 80% of the pages whose URLs were crawled by MAMA.
  • Removal or down-weighting of content in template blocks tends to increase the accuracy of data mining tasks such as classification or clustering.
  • the amount of text and the number of links in every node are analyzed and they used a heuristic measure to determine the node (or a set of nodes) most likely to contain the main content. For every node in the DOM tree, two counts are maintained textCnt that holds the number of words contained in the node and linkCnt that holds the number of links in or below the node.
  • A maximal semantic block, or simply block, is the largest of the consistent frames on the path from a leaf to the root of a frame tree. Thus, it is likely to be the largest possible cluster containing semantically related pieces of content.
  • The FindBlocks algorithm is used to find the blocks in a frame tree. The algorithm runs a depth-first search over the frame tree and recursively determines whether the frames are consistent, ignoring the alignment of leaf frames
  • The Partition algorithm is used to find the partitions in a block in the block tree. The algorithm runs bottom-up over the frame subtree formed by a block. For each node N, it determines the maximal repeating pattern(s) in presentation style in N’s children.
  • A separator is either one of the HTML tags HR or P, or is white space (horizontal or vertical) between two adjacent nodes that is greater than the mean amount of white space between adjacent nodes in this node list.

What it does (or should do)

  • removes <br /> tags and converts them appropriately to new paragraphs
  • converts free text lists to html lists
  • normalize tables
  • converts sections into actual <secion> blocks
  • removes comments/meta/styles from the body
  • annotates dates and times
  • annotates addresses/locations
  • it should know what the site assets are so it can filter out images that are just assets (such as a checkmark or an icon)
    • extracts images like wget.
  • keeps important canvas/object/embed/video/img tags, removes unimportant ones.
  • make headers of same number at same dom level
  • wrap headers and content in section tag.

Phases

  • normalize the html
  • extract metadata from normalized html
  • output html or json

Content Extraction Methodology

It checks each DOM node and figures out at what level the content is at. So if there are 3 paragraphs sufficient text.length all at 5 levels deep (all siblings as well), then we can assume the parent DOM element is the content area. If the parent is inside of a list item, then perhaps it's a comment thread.

If the content DOM elements have class/id matching common patterns ("content", "title", "main", and others TBD), then increase its score slightly.

A lot of the DOM nodes will have class names with other prefixes/suffixes, such as "yui-content" or "yan-content", so it should just match a pattern.

If it's in question/answer form, maybe we wrap them in section tags.

Examples

Comment Threads

Tools

Development

cd test/fixtures
wget -x -F http://en.wikipedia.org/wiki/Main_Page.html

Example pages to train with

Tasks

First, crawl the links people post on facebook and twitter most often and get a list of the class and id structures on the dom:

body > div.mw-body > h1.firstHeading
body > div.mw-body > div > div
body > div.mw-body > div > div.mw-jump > a
body > div.mw-body > div > div.mw-content-ltr > div.dablink > a
body > div.mw-body > div > div.mw-content-ltr > table.infobox > tr
body > div.mw-body > div > div.mw-content-ltr > p > strong
body > div.mw-body > div > div.mw-content-ltr > p > a.mw-redirect
body > div.mw-body > div > div.mw-content-ltr > p > a
body > div.mw-body > div > div.mw-content-ltr > p > sup.reference
body > div.mw-body > div > div.mw-content-ltr > table.toc
body > div.mw-body > div > div.mw-content-ltr > h2
body > div.mw-body > div > div.mw-content-ltr > p
body > div.mw-body > div > div.mw-content-ltr > ul > li > a
body > div.mw-body > div > div.mw-content-ltr > ul > li > a.mw-redirect
body > div.mw-body > div > div.mw-content-ltr > ul > li
body > div.mw-body > div > div.mw-content-ltr > ul > li > sup.reference
body > div.mw-body > div > div.mw-content-ltr > p > sub
body > div.mw-body > div > div.mw-content-ltr > h3
body > div.mw-body > div > div.mw-content-ltr > div.mainarticle.relarticle.rellink > a
body > div.mw-body > div > div.mw-content-ltr > p > em
body > div.mw-body > div > div.mw-content-ltr > p > a.new
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > caption
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > th
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > th > sup.reference
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > td
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > td > a
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > td > ul > li
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > td > a.mw-redirect
body > div.mw-body > div > div.mw-content-ltr > table.sortable.wikitable > tr > td
body > div.mw-body > div > div.mw-content-ltr > table.sortable.wikitable > tr > td > a
body > div.mw-body > div > div.mw-content-ltr > table.sortable.wikitable > tr > td > a.mw-redirect
body > div.mw-body > div > div.mw-content-ltr > table.sortable.wikitable > tr > td > a > sub
body > div.mw-body > div > div.mw-content-ltr > p > a > small
body > div.mw-body > div > div.mw-content-ltr > div.noprint.portal.tright
body > div.mw-body > div > div.mw-content-ltr > div.references-column-count.references-column-count-2.reflist
body > div.mw-body > div > div.mw-content-ltr > table.mbox-small.metadata.plainlinks
body > div.mw-body > div > div.mw-content-ltr > ul > li > a.external.text
body > div.mw-body > div > div.mw-content-ltr > table.navbox
body > div.mw-body > div > div.printfooter > a
body > div.mw-body > div > div.catlinks
body > div.noprint > div > h5
body > div.noprint > div > ul > li
body > div.noprint > div > div.vectorTabs > h5
body > div.noprint > div > div.vectorTabs > ul > li.selected
body > div.noprint > div > div.vectorTabs > ul > li
body > div.noprint > div > div.emptyPortlet.vectorMenu
body > div.noprint > div > div.vectorTabs > ul > li.collapsible
body > div.noprint > div > div
body > div.noprint > div.portal > h5
body > div.noprint > div.portal > div.body
body > div > ul > li
body > div > ul > li > a
body > div > ul > li.noprint

Then rank the content for the different areas manually. This should allow us to program something that gets the content out.

You can tell from above that the content is all at the 5th level, below mw-content-ltr. We know that because the elements at that level are all "content" elements: table, p, ul... and the text in those tags has sentences.

If all else fails, you can have a database pointing to certain content sections (archetypes) for a url or directory, and this will be a specific way to parse content. Otherwise it can just fall back to the generic versions.

Decision Tree

  • is this page a page template?
    • you can only tell because you've search other pages in the same domain/folder, so you have to know if it's a good idea to (home pages aren't good page template candidates for example)
    • if yes, then you might have a higher confidence of what the content area is.
    • is it worth the extra computation of searching other pages on the domain?
  • check DOM structure, what information can we get out of the classification of the DOM structure?
    • class name
    • class name and level in the DOM (for example, "content" on the 5th level is a high-score content area perhaps).
  • maybe, first, break pages into visual sections. then you know the navigation area is more likely to have a list of links than paragraphs, stuff like that. This would also make it easier to figure out which tables/divs are used for layout vs. data.
  1. Geometric Representation

  2. Semantic Representation

  3. DOM-based Representation

  4. Partitioning Clustering

  5. Agglomerative Hierarchical Clustering

  6. Density-based Clustering

Features

Table Features

  • borders
  • fonts in header vs. body cells
  • colors in header vs. body
  • font size in header vs. body
  • similarity of text in columns
  • difference in text between header and body
  • difference in text between rows (layout vs. data table)
  • number of cells per row (maybe large in data vs. layout tables)
  • width/height of cells
  • presence of pagination controls
  • date of webpage (might give indication of tables being used for layout - older sites use tables for layout)
  • DOM depth of table
  • tables nested within parent table cells
  • captions associated with table
  • text referring to table (e.g. "see data in figure 3.")
  • presence of comments mentioning what the table is used for
  • dom tree structure

Tables for layout usually...

  • have few rows and few cells per row.
  • have content in cells that is wildly inconsistent in length
  • have much HTML within cells
  • may use colspan / rowspan
  • exist near the top of the DOM
  • not make use of <th> or <thead>
  • contain other tables
  • Tables used for data will generally

Tables for data usually...

  • have more rows and more cells per row
  • have content in cells that is reasonably consistent in length
  • lack structuring HTML within cells (like <div>, <p>; seeing <b>, <strong>, etc does not preclude data)
  • probably not use colspan and very probably not use rowspan
  • not contain other tables

Header Features

  • font size
  • how to remove things like wikipedia's [edit] link?
  • remove links

Text List item Features

  • find lists that are not marked up as lists
  • presence of colon : before list
  • list margins in relation to parent text

Gallery List Item Features

These are more robust list items.

Definition List Features

Quote Features

  • <blockquote> and <q> tags

Sections of a Web Page

  • main header (logo, search, etc.)
  • main menu
  • main footer
  • main sidebar (1 or 2 sidebars)
  • article header
  • article menu (submenu/local menu, which can be nested, such as on documentation sites with lots of content, but it's then commonly a tree on the sidebar)
  • article body
  • article footer

Visual Block Extraction Flow

  1. visual block extraction
  2. visual separator detection
  3. content structure construction

Test Client Requirements

  • download a url
  • download all stylesheets, images, and javascripts for that path
  • normalize all paths in downloaded HTML document to relative paths
wget -H -N -k -p http://example.com

Test Phases

  1. be able to draw/highlight sidebars, menus, footers, headers, banners, and main content areas on all demo pages.
  2. be able to highlight all large list items (gallery items, comments, and other repeated items) in the main content area.
  3. be able to determine if main content area is single list of items, page of text, a single table, or landing page with lots of different content block types
  4. classify different types of content that are found in the main content area, broadly. by giving them a score here you'll be able to narrow down the operations to do on the next phase.
  5. in free text, be able to extract all paragraphs/blocks of text
  6. in free text, be able to extract all tables
  7. in free text, be able to extract all lists
  8. in list items, be able to remove repeated elements (like share links)

In the meantime write scripts to be able to:

  • list all the font families/sizes/colors on the page, and the number of dom elements with them.
  • count characters for span level elements, and compare them to background colors and sibling elements
  • headers are almost like models, so the header of a table is the "model".
  • check for things that are indented, which are often definition lists.
  • ncbi taxonomy item has a horrible table example
    • maybe table cells with specific attributes are an indicator that they're probably used for layout vs. data table.
    • if there are colons it may be an indicator of a non-html table.
    • if it's all span elements (em, strong, a, etc.) interspersed with br tags, then maybe they're manually making a table.
    • if the table cells (td/th) are filled with just text or span elements, maybe it's a data table. But if there's br tags in there, then maybe not.
  • headers are defined by larger fonts and whitespace
    • find statistics for header: font-size, weight, color, margin, padding, compared to normal margin and padding of page.
  • footnotes are generally smaller and perhaps in italics.
  • if concentration of leaf nodes is almost 100%, then it may be a menu.
  • presence of hr or divs with only borders and no content might be sign of thematic break.
  • if sibling margins alternate (0px, 40px, 0px, 40px or something) then it might be a definition list.
    • for each sibling, gather fonts and margins
  • you can't count on tables using td/th correctly or having a thead
    • but if all cells in a row have the same color, or a different color than most of the other rows (compare color values), then it might be a header (bgcolor).
    • text color may be the same even though row color may differ btwn header and body
    • border-collapse, if present on all cells then much higher chance it's a data table. Are 100% of them with border-collapse: separate, or just 90%, stuff like that.
    • if all text in one column is center or whatever, that also is a factor.
    • if the table has a border
  • lists are leaf nodes with links not surrounded by other text.
    • $('.main-nav *').filter(function() { return $(this).children().length == 0 });
  • the larger visual blocks probably contain content
  • both menus and galleries have lists of links, so there must be more distinguishing factors.
  • after dividing visual chunks up horizontally, then vertically in the content area.
  • if there are a bunch of siblings with the same class, then that counts. but they can be chunked iinto rows with divs, so that needs to be subtracted.
  • if pagination controls follows or precedes several similar items then the simliar items are more likely a list.
    • once you get to a list, you want to extract out the properties.
  • header: if largest font matches browser title tag.
  • to remove [edit] from wikipedia, remove similar elements from headers.
  • the definition of noise is based on the following assumptions: (1) The more presentation styles that an element node has, the more important it is, and vice versa. (2) The more diverse that the actual contents of an element node are, the more important the element node is, and vice versa. Both these importance values are used in evaluating the importance of an element node.
  • number of styles on an element may be important.
  • diversity of elements in a parent.
  • A character alignment graph (CAG) is used to find text tables in documents
  • if any items start below a minimum height (say 400px) then it's probably a footer or something not important.
  • you can check if a certain character is aligned, such as in https://bugzilla.mozilla.org/show_bug.cgi?id=654352.
  • determine similarity of sibling item styles and sizes.

Focus on each one for a few days.

Goals

Should output something like this:

{
  "url": "http://example.com"
  "sections": [
    {
      "type": "header"
    },
    {
      "type": "navigation"
    },
    {
      "type": "sidebar",
      "sections": [
        {
          "type": "navigation"
        },
        {
          "type": "content"
        }
      ]
    },
    {
      "type": "content", // this is the main content
      "sections": [
        {
          "type": "header"
        },
        {
          "type": "content",
          "sections": [
            {
              "type": "blockquote"
            },
            {
              "type": "paragraph"
            },
            {
              "type": "table"
            },
            {
              "type": "paragraph"
            },
            {
              "type": "ul"
            },
            {
              "type": "dl"
            },
            {
              "type": "figure"
            },
            {
              "type": "paragraph"
            }
          ]
        }
      ]
    },
    {
      "type": "footer",
      "sections": [
        {
          "type": "navigation"
        },
        {
          "type": "content"
        }
      ]
    },
  ]
}

Tools

  • be able to copy/paste HTML section as a "demo" of some layout.

PDFtoTEXT

brew install xpdf
brew install swftools # http://blog.9mmedia.com/?p=27
brew install imagemagick
brew install leptonica
brew install tesseract
# install poppler and xpdf
pdftotext <pdf> <output>
pdfimages <pdf> <folder>
# then convert .ppm to .jpg
# one at a time:
# convert pdf-images-001.ppm pdf-images-001.jpg
# batch:
mogrify -format jpg *.ppm
# extract text from images
# adjust image for potentially better ocr:
# convert -sharpen 1 -brightness-contrast 3X30 input.jpg input.tiff
# large images might work better as well:
# convert input.png -resize 400% -type Grayscale input.tiff
# convert -density 200 -units PixelsPerInch -type Grayscale +compress image-of-text.png image-of-text.tiff
convert neitzche-quote.png -resize 400% -type Grayscale input.tiff ; tesseract input.tiff ocr-text-output -l eng ; open ocr-text-output.txt
$ pdffonts /Users/viatropos/Desktop/A\ Meta-Notation\ for\ Data\ Visualization.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
AYMUFJ+NimbusRomNo9L-Medi            Type 1            yes yes no       7  0
ADQTIV+NimbusRomNo9L-Regu            Type 1            yes yes no       8  0
SSLXCX+NimbusRomNo9L-Regu-Slant_167  Type 1            yes yes no      10  0
GRTKIU+NimbusRomNo9L-ReguItal        Type 1            yes yes no       9  0
HNPRVI+CMEX10                        Type 1            yes yes yes     25  0
LYGYSO+CMR10                         Type 1            yes yes no      20  0
EROJKM+CMSY10                        Type 1            yes yes no      19  0
EROJKM+CMSY10                        Type 1            yes yes yes     26  0
SHDPOJ+CMMI10                        Type 1            yes yes no      23  0
TFINCX+StandardSymL-Slant_167        Type 1            yes yes yes     22  0

Position of Character

var s = window.getSelection();
var r = s.getRangeAt(0);
r.getBoundingClientRect();
// at once:
window.getSelection().getRangeAt(0).getBoundingClientRect();
 
// set selection
// http://stackoverflow.com/questions/6190143/javascript-set-window-selection
var s = window.getSelection();
s.removeAllRanges();
var r = document.createRange();
//r.selectNodeContents(node)
r.setStart(node, 15);
r.setEnd(node);
s.addRange(r);
 
function setEndOfContenteditable(contentEditableElement)
{
    var range,selection;
    if(document.createRange)//Firefox, Chrome, Opera, Safari, IE 9+
    {
        range = document.createRange();//Create a range (a range is a like the selection but invisible)
        range.selectNodeContents(contentEditableElement);//Select the entire contents of the element with the range
        range.collapse(false);//collapse the range to the end point. false means collapse to end rather than the start
        selection = window.getSelection();//get the selection object (allows you to change selection)
        selection.removeAllRanges();//remove any selections already made
        selection.addRange(range);//make the range you have just created the visible selection
    }
    else if(document.selection)//IE 8 and lower
    { 
        range = document.body.createTextRange();//Create a range (a range is a like the selection but invisible)
        range.moveToElementText(contentEditableElement);//Select the entire contents of the element with the range
        range.collapse(false);//collapse the range to the end point. false means collapse to end rather than the start
        range.select();//Select the range (make it the visible selection
    }
}
 
// https://bug-23189-attachments.webkit.org/attachment.cgi?id=26527
// http://code.google.com/p/rangy/
document.designMode
 
var range, selection;
var node = $('#question p').get(0)
range = document.createRange();//Create a range (a range is a like the selection but invisible)
range.selectNodeContents(node);//Select the entire contents of the element with the range
range.collapse(false);//collapse the range to the end point. false means collapse to end rather than the start
selection = window.getSelection();//get the selection object (allows you to change selection)
selection.removeAllRanges();//remove any selections already made
selection.addRange(range);//make
selection.modify('move', 'left', 'character')
range = selection.getRangeAt(0);
var position = range.getClientRects()[0];
console.log(JSON.stringify(position));

Primary Visual Blocks

  • first find all nodes that are parent's width that add up in height to the parent.

  • then find all nodes in there that are parents width. if none/some are not parent's width, then you may be in a "leaf component".

  • header:

    • width is parent width
      • inside header:
        • e.g.: there are two stacked nodes the same width as the parent, might be:
          • header + navigation
        • e.g.: nodes aren't as wide, then might be:
          • navigation or header
        • if the height is small enough, you still can't assume it's not the content area because what if the content area is just blank?
  • once you have the first level of vertical blocks, you can see test for headers/footers/navigation:

    • headers potentially have:
      • large font
      • image/logo
    • navigation potentially has:
      • many links (ideally list of links)
      • maybe nested links
      • background different than main
      • may be along the top or along the side (and may be on the side even if there is a top main header)
  • maybe at a certain depth the number of nodes increases

  • where the largest font is (and possibly h1), might be the main content area.

  • if there are multiple layers of main navigation, one layer may have a larger font and can be considered 'primary'.

  • write code that can tell you "here are the differences between the main areas"

  • maybe all headers are within a certain height.

Features

These are the machine learning features.

Booleans

hasLeftSidebar
hasRightSidebar
hasHeader
hasNavigation
hasFooter
hasContent
# if it's a single "resource" as a user sees it. 
contentHasArticle
# whether it is a list of items like in a gallery or not (opposite of hasArticle). 
contentHasList # contentHasGallery 
contentHasHeader
contentHasFooter
contentHasDifferentBackgroundColorThanBody
contentHasDifferentBackgroundColorThanSidebar
# leftSidebarIsWiderThanArticle 
# this is unusual, but it does exist. 
leftSidebarWidthIsGreaterThanContent
# leftSidebarIsTallerThanArticle 
leftSidebarHeightIsGreaterThanContent
# all non-leaf descendents of a block are x/y aligned 
frameIsXConsistent

Numbers

leftSidebarLinkCount
leftSidebarListCount
fontSize/color/family/style/case in each area
# and all the same for rightSidebar 

Values

contentBackgroundColor
base colors for any main visual element

Unknowns

  • do links in main nav link to more external/subdomain paths than sidebar links (which might be more internal links)

Random Links

IEEE Articles (from berkeley)

  • GREAT: 6195393
    • build properties tree: position, color, border, font family, font size, node name, and content of node
  • 4285135

Readme

Keywords

none

Package Sidebar

Install

npm i normalize.js

Weekly Downloads

2

Version

0.0.1

License

none

Last publish

Collaborators

  • viatropos