Normalize.js

Nothing to see yet, just some research for now.

Research

Keywords

table detection
list detection
ad detection
training data
feature detection
automatic annotation
web content extraction
wrapper induction
structured data extraction (wrapper generation)
Google Sets
WebTables
template detection
schema-mapping problem
ordered tree isomorphism
layout-based document clustering
web information extraction (WIE)
site-level template detection
page-level template detection
visual block extraction
noise elimination
VIPS algorithm
noise-block detection
block segmentation algorithm
informative content
information retrieval
Visual Clustering Extractor (VCE)
block tree construction
NIT (Node Information Threshold)
determining page layout
maximal semantic block
page segmentation
content preparation
visual area classification
web document analysis (WDA)
support vector machine (SVM)
principle segmentation (process of segmenting a web page into macro elements "header", "footer", "sidebar", "main content")
visual boundary detection in html

Table Detection

A Machine Learning Based Approach for Table Detection on The Web
http://chemxseer.ist.psu.edu/about/digital_library/das08-liu.pdf
http://stackoverflow.com/questions/10766161/table-detection-algorithms
Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines
Learning Table Extraction from Examples
WebTables: Exploring the Power of Tables on the Web
Data Management Projects at Google
Uncovering the Relational Web
Detection of Layout-Purpose TABLE Tags Based on Machine Learning
http://dev.opera.com/articles/view/mama-tables/
Automating the Extraction of Data from HTML Tables with Unknown Structure
Automatic Table Detection in Document Images
- To filter out useful tables, WebSets uses the following set of features: (1) the number of rows (2) the number of non-link columns (3) the length of cells after removing formatting tags (4) whether table contains other HTML tables. The thresholds set for our experiments are explained in Section 4.
GREAT! Towards Domain-Independent Information Extraction from Web Tables
- In a global way, a table may be de¯ned as a two-dimensional presentation of logical relationships between groups of data. Those connections are reflected by horizontal and vertical alignment of data in a grid.
- A list is a series of similar data items or data records. A list can be either one-dimensional or two-dimensional; in both variants, no hierarchical or other semantic relationships in between individual list items is implied except for a possible ordering of the items.
- Tables are together with lists and some domain-speci¯c aligned graphics one of the three dominant spatially structured data formats found on web pages. The constituent logical units of the dominant structure either have atomic content, or nested substructures such as lists of data records.
- SEARCH RELATED PAPERS FROM BERKELEY!
- Table extraction using conditional random fields
  - http://acl.ldc.upenn.edu/hlt-naacl2004/main/pdf/176_Paper.pdf
- pdf2table: A Method to Extract Table Information from PDF Files
Discovering Informative Content Blocks from Web Documents
Table Structure Recognition Based On Robust Block Segmentation

List Detection

FAQ Mining via List Detection
Unexpected Results in Automatic List Extraction on the Web
Information Extraction from HTML Documents by Structural Matching (good!)
- sometimes lists are formed from repeating paragraphs and other elements.
- All these mechanisms can be considered as a canonicalizing transformation of the DOM tree prior to comparison (vaguely reminiscent of mechanisms in natural language processing): by putting the DOM tree into canonical form, both accidental agreements between different variables, and accidental disagreements due to effects like spacing, are eliminated for the analysis.

Ad Detection

Sentence Detection

DETECTING SIMILAR HTML DOCUMENTS USING A SENTENCE-BASED COPY DETECTION APPROACH

Page Template / Fragment Detection

Checks all pages for a domain and figures out which DOM nodes are statistically most likely part of the template. That info can then be used to remove things like sidebars, menus, and ads.

Page-level Template Detection via Isotonic Smoothing
Visual Content Detection Algorithms
- http://scholar.google.com/scholar?hl=en&lr=&cites=3400962720524273111&um=1&ie=UTF-8&sa=X&ei=e9D1T-zEJImXrAGs6vCLCQ&ved=0CEoQzgIwAA
- http://db.ucsd.edu/webdb2006/camera-ready/paginated/all.pdf#page=18
Detecting the Dynamic Parts of an HTML Document
Automatic Fragment Detection in Dynamic Web Pages
http://www.cs.kent.edu/~javed/class-IAD03F/course-area/IAD03F-Thong-DetectingWebPageStructure.pdf
Incremental Web Page Template Detection
A Fast and Robust Method for Web Page Template Detection and Removal
Site-Independent Template-Block Detection
Sequence-based Web Page Template Detection (thesis)
http://research.microsoft.com/en-us/um/people/jrwen/jrwen_files/publications/zheng-kdd07.pdf
GREAT! VIPS: a Vision-based Page Segmentation Algorithm
- because of the flexibility of HTML syntax, a lot of web pages do not obey the W3C html specifications, which might cause mistakes in DOM tree structure. Moreover, DOM tree is initially introduced for presentation in the browser rather than description of the semantic structure of the web page.
Optimal Web Design
Text block segmentation using pyramid structure
A General Approach for Partitioning Web Page Content Based on Geometric and Style Information
http://www.ecse.rpi.edu/homepages/nagy/PDF_chrono/2011_GN_tables_ICDAR11.pdf

Content Extraction

Image Boundary Detection

Take a picture of the webpage with PhantomJS and compute pixel locations of boundaries, then use that info to help traversing the DOM.

http://vision.ece.ucsb.edu/publications/00edgeflow.pdf

Semantic Annotation

Automatic Annotation

Code

JSDom is a headless browser for Node.js. Cheerio allows for DOM traversal but doesn't allow JS/CSS (or visual) stuff.

Other

Tools

website to image, so you can trace pixels: http://stackoverflow.com/questions/4550947/generating-a-screenshot-of-a-website-using-jquery
https://github.com/jiminoc/goose/wiki
http://smartr.mobi/
http://code.google.com/p/boilerpipe/
http://purifyr.com/

Notes

Feature selection is a crucial step in any machine learning based methods.
Content block can be identified based on the appearance of the same block in multiple web pages. The algorithm first partitions the web page into blocks based on different HTML tags. The algorithm then classifies each block as either a content block or a non-content block. The algorithm compares a block, B, with the stored block to check whether it is similar to a stored one, if so then it is not necessary to store that block again.
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site’s Web-pages changes.
MAMA (Metadata Analysis and Mining Application) is a structural Web page search engine from Opera Software that crawls Web pages and returns results detailing page structures. If we look into MAMA’s key findings, we see that the average website has a table structure nested three levels deep. On the list of 10 most popular tags, table, td and tr are all there. The table element is found on over 80% of the pages whose URLs were crawled by MAMA.
Removal or down-weighting of content in template blocks tends to increase the accuracy of data mining tasks such as classiﬁcation or clustering.
the amount of text and the number of links in every node are analyzed and they used a heuristic measure to determine the node (or a set of nodes) most likely to contain the main content. For every node in the DOM tree, two counts are maintained textCnt that holds the number of words contained in the node and linkCnt that holds the number of links in or below the node.
A maximal semantic block, or simply block, is the largest of the consistent frames on the path from a leaf to the root of a frame tree. Thus, it is likely to be the largest possible cluster containing semantically related pieces of content.
The FindBlocks algorithm is used to ﬁnd the blocks in a frame tree. The algorithm runs a depth-ﬁrst search over the frame tree and recursively determines whether the frames are consistent, ignoring the alignment of leaf frames
The Partition algorithm is used to ﬁnd the partitions in a block in the block tree. The algorithm runs bottom-up over the frame subtree formed by a block. For each node N, it determines the maximal repeating pattern(s) in presentation style in N’s children.
A separator is either one of the HTML tags HR or P, or is white space (horizontal or vertical) between two adjacent nodes that is greater than the mean amount of white space between adjacent nodes in this node list.

What it does (or should do)

removes <br /> tags and converts them appropriately to new paragraphs
converts free text lists to html lists
normalize tables
converts sections into actual <secion> blocks
removes comments/meta/styles from the body
annotates dates and times
annotates addresses/locations
it should know what the site assets are so it can filter out images that are just assets (such as a checkmark or an icon)
- extracts images like wget.
keeps important canvas/object/embed/video/img tags, removes unimportant ones.
make headers of same number at same dom level
wrap headers and content in section tag.

Phases

normalize the html
extract metadata from normalized html
output html or json

Content Extraction Methodology

It checks each DOM node and figures out at what level the content is at. So if there are 3 paragraphs sufficient text.length all at 5 levels deep (all siblings as well), then we can assume the parent DOM element is the content area. If the parent is inside of a list item, then perhaps it's a comment thread.

If the content DOM elements have class/id matching common patterns ("content", "title", "main", and others TBD), then increase its score slightly.

A lot of the DOM nodes will have class names with other prefixes/suffixes, such as "yui-content" or "yan-content", so it should just match a pattern.

If it's in question/answer form, maybe we wrap them in section tags.

Examples

Comment Threads

http://in.answers.yahoo.com/question/index?qid=20091212005013AAlnUtb

Tools

http://prettydiff.com/?m=beautify&html

Development

cd test/fixtures
wget -x -F http://en.wikipedia.org/wiki/Main_Page.html

Example pages to train with

Tasks

First, crawl the links people post on facebook and twitter most often and get a list of the class and id structures on the dom:

body > div.mw-body > h1.firstHeading
body > div.mw-body > div > div
body > div.mw-body > div > div.mw-jump > a
body > div.mw-body > div > div.mw-content-ltr > div.dablink > a
body > div.mw-body > div > div.mw-content-ltr > table.infobox > tr
body > div.mw-body > div > div.mw-content-ltr > p > strong
body > div.mw-body > div > div.mw-content-ltr > p > a.mw-redirect
body > div.mw-body > div > div.mw-content-ltr > p > a
body > div.mw-body > div > div.mw-content-ltr > p > sup.reference
body > div.mw-body > div > div.mw-content-ltr > table.toc
body > div.mw-body > div > div.mw-content-ltr > h2
body > div.mw-body > div > div.mw-content-ltr > p
body > div.mw-body > div > div.mw-content-ltr > ul > li > a
body > div.mw-body > div > div.mw-content-ltr > ul > li > a.mw-redirect
body > div.mw-body > div > div.mw-content-ltr > ul > li
body > div.mw-body > div > div.mw-content-ltr > ul > li > sup.reference
body > div.mw-body > div > div.mw-content-ltr > p > sub
body > div.mw-body > div > div.mw-content-ltr > h3
body > div.mw-body > div > div.mw-content-ltr > div.mainarticle.relarticle.rellink > a
body > div.mw-body > div > div.mw-content-ltr > p > em
body > div.mw-body > div > div.mw-content-ltr > p > a.new
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > caption
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > th
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > th > sup.reference
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > td
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > td > a
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > td > ul > li
body > div.mw-body > div > div.mw-content-ltr > table.wikitable > tr > td > a.mw-redirect
body > div.mw-body > div > div.mw-content-ltr > table.sortable.wikitable > tr > td
body > div.mw-body > div > div.mw-content-ltr > table.sortable.wikitable > tr > td > a
body > div.mw-body > div > div.mw-content-ltr > table.sortable.wikitable > tr > td > a.mw-redirect
body > div.mw-body > div > div.mw-content-ltr > table.sortable.wikitable > tr > td > a > sub
body > div.mw-body > div > div.mw-content-ltr > p > a > small
body > div.mw-body > div > div.mw-content-ltr > div.noprint.portal.tright
body > div.mw-body > div > div.mw-content-ltr > div.references-column-count.references-column-count-2.reflist
body > div.mw-body > div > div.mw-content-ltr > table.mbox-small.metadata.plainlinks
body > div.mw-body > div > div.mw-content-ltr > ul > li > a.external.text
body > div.mw-body > div > div.mw-content-ltr > table.navbox
body > div.mw-body > div > div.printfooter > a
body > div.mw-body > div > div.catlinks
body > div.noprint > div > h5
body > div.noprint > div > ul > li
body > div.noprint > div > div.vectorTabs > h5
body > div.noprint > div > div.vectorTabs > ul > li.selected
body > div.noprint > div > div.vectorTabs > ul > li
body > div.noprint > div > div.emptyPortlet.vectorMenu
body > div.noprint > div > div.vectorTabs > ul > li.collapsible
body > div.noprint > div > div
body > div.noprint > div.portal > h5
body > div.noprint > div.portal > div.body
body > div > ul > li
body > div > ul > li > a
body > div > ul > li.noprint

Then rank the content for the different areas manually. This should allow us to program something that gets the content out.

You can tell from above that the content is all at the 5th level, below mw-content-ltr. We know that because the elements at that level are all "content" elements: table, p, ul... and the text in those tags has sentences.

If all else fails, you can have a database pointing to certain content sections (archetypes) for a url or directory, and this will be a specific way to parse content. Otherwise it can just fall back to the generic versions.

Decision Tree

is this page a page template?
- you can only tell because you've search other pages in the same domain/folder, so you have to know if it's a good idea to (home pages aren't good page template candidates for example)
- if yes, then you might have a higher confidence of what the content area is.
- is it worth the extra computation of searching other pages on the domain?
check DOM structure, what information can we get out of the classification of the DOM structure?
- class name
- class name and level in the DOM (for example, "content" on the 5th level is a high-score content area perhaps).
maybe, first, break pages into visual sections. then you know the navigation area is more likely to have a list of links than paragraphs, stuff like that. This would also make it easier to figure out which tables/divs are used for layout vs. data.

Geometric Representation
Semantic Representation
DOM-based Representation
Partitioning Clustering
Agglomerative Hierarchical Clustering
Density-based Clustering

Features

Table Features

borders
fonts in header vs. body cells
colors in header vs. body
font size in header vs. body
similarity of text in columns
difference in text between header and body
difference in text between rows (layout vs. data table)
number of cells per row (maybe large in data vs. layout tables)
width/height of cells
presence of pagination controls
date of webpage (might give indication of tables being used for layout - older sites use tables for layout)
DOM depth of table
tables nested within parent table cells
captions associated with table
text referring to table (e.g. "see data in figure 3.")
presence of comments mentioning what the table is used for
dom tree structure

Tables for layout usually...

have few rows and few cells per row.
have content in cells that is wildly inconsistent in length
have much HTML within cells
may use colspan / rowspan
exist near the top of the DOM
not make use of <th> or <thead>
contain other tables
Tables used for data will generally

Tables for data usually...

have more rows and more cells per row
have content in cells that is reasonably consistent in length
lack structuring HTML within cells (like <div>, <p>; seeing <b>, <strong>, etc does not preclude data)
probably not use colspan and very probably not use rowspan
not contain other tables

Header Features

font size
how to remove things like wikipedia's [edit] link?
remove links

Text List item Features

find lists that are not marked up as lists
presence of colon : before list
list margins in relation to parent text

Gallery List Item Features

These are more robust list items.

Definition List Features

Quote Features

<blockquote> and <q> tags

Sections of a Web Page

main header (logo, search, etc.)
main menu
main footer
main sidebar (1 or 2 sidebars)
article header
article menu (submenu/local menu, which can be nested, such as on documentation sites with lots of content, but it's then commonly a tree on the sidebar)
article body
article footer

Visual Block Extraction Flow

visual block extraction
visual separator detection
content structure construction

Test Client Requirements

download a url
download all stylesheets, images, and javascripts for that path
normalize all paths in downloaded HTML document to relative paths

wget -H -N -k -p http://example.com

setup redis and show redis queue
mongodb save resources with tags
download site with wget to EC2 on a node instance
headless webkit rendering on ec2
- http://code.google.com/p/phantomjs/
- http://www.kennberg.com/blog/?p=56

Test Phases

be able to draw/highlight sidebars, menus, footers, headers, banners, and main content areas on all demo pages.
be able to highlight all large list items (gallery items, comments, and other repeated items) in the main content area.
be able to determine if main content area is single list of items, page of text, a single table, or landing page with lots of different content block types
classify different types of content that are found in the main content area, broadly. by giving them a score here you'll be able to narrow down the operations to do on the next phase.
in free text, be able to extract all paragraphs/blocks of text
in free text, be able to extract all tables
in free text, be able to extract all lists
in list items, be able to remove repeated elements (like share links)

In the meantime write scripts to be able to:

list all the font families/sizes/colors on the page, and the number of dom elements with them.
count characters for span level elements, and compare them to background colors and sibling elements
headers are almost like models, so the header of a table is the "model".
check for things that are indented, which are often definition lists.
ncbi taxonomy item has a horrible table example
- maybe table cells with specific attributes are an indicator that they're probably used for layout vs. data table.
- if there are colons it may be an indicator of a non-html table.
- if it's all span elements (em, strong, a, etc.) interspersed with br tags, then maybe they're manually making a table.
- if the table cells (td/th) are filled with just text or span elements, maybe it's a data table. But if there's br tags in there, then maybe not.
headers are defined by larger fonts and whitespace
- find statistics for header: font-size, weight, color, margin, padding, compared to normal margin and padding of page.
footnotes are generally smaller and perhaps in italics.
if concentration of leaf nodes is almost 100%, then it may be a menu.
presence of hr or divs with only borders and no content might be sign of thematic break.
if sibling margins alternate (0px, 40px, 0px, 40px or something) then it might be a definition list.
- for each sibling, gather fonts and margins
you can't count on tables using td/th correctly or having a thead
- but if all cells in a row have the same color, or a different color than most of the other rows (compare color values), then it might be a header (bgcolor).
- text color may be the same even though row color may differ btwn header and body
- border-collapse, if present on all cells then much higher chance it's a data table. Are 100% of them with border-collapse: separate, or just 90%, stuff like that.
- if all text in one column is center or whatever, that also is a factor.
- if the table has a border
lists are leaf nodes with links not surrounded by other text.
- $('.main-nav *').filter(function() { return $(this).children().length == 0 });
the larger visual blocks probably contain content
both menus and galleries have lists of links, so there must be more distinguishing factors.
after dividing visual chunks up horizontally, then vertically in the content area.
if there are a bunch of siblings with the same class, then that counts. but they can be chunked iinto rows with divs, so that needs to be subtracted.
if pagination controls follows or precedes several similar items then the simliar items are more likely a list.
- once you get to a list, you want to extract out the properties.
header: if largest font matches browser title tag.
to remove [edit] from wikipedia, remove similar elements from headers.
the definition of noise is based on the following assumptions: (1) The more presentation styles that an element node has, the more important it is, and vice versa. (2) The more diverse that the actual contents of an element node are, the more important the element node is, and vice versa. Both these importance values are used in evaluating the importance of an element node.
number of styles on an element may be important.
diversity of elements in a parent.
A character alignment graph (CAG) is used to ﬁnd text tables in documents
if any items start below a minimum height (say 400px) then it's probably a footer or something not important.
you can check if a certain character is aligned, such as in https://bugzilla.mozilla.org/show_bug.cgi?id=654352.
determine similarity of sibling item styles and sizes.

Focus on each one for a few days.

Goals

http://dev.w3.org/html5/spec/single-page.html#the-ul-element

Should output something like this:

{
  "url": "http://example.com"
  "sections": [
    {
      "type": "header"
    },
    {
      "type": "navigation"
    },
    {
      "type": "sidebar",
      "sections": [
        {
          "type": "navigation"
        },
        {
          "type": "content"
        }
      ]
    },
    {
      "type": "content", // this is the main content
      "sections": [
        {
          "type": "header"
        },
        {
          "type": "content",
          "sections": [
            {
              "type": "blockquote"
            },
            {
              "type": "paragraph"
            },
            {
              "type": "table"
            },
            {
              "type": "paragraph"
            },
            {
              "type": "ul"
            },
            {
              "type": "dl"
            },
            {
              "type": "figure"
            },
            {
              "type": "paragraph"
            }
          ]
        }
      ]
    },
    {
      "type": "footer",
      "sections": [
        {
          "type": "navigation"
        },
        {
          "type": "content"
        }
      ]
    },
  ]
}

Tools

be able to copy/paste HTML section as a "demo" of some layout.

PDFtoTEXT

brew install xpdf
brew install swftools # http://blog.9mmedia.com/?p=27
brew install imagemagick
brew install leptonica
brew install tesseract
# install poppler and xpdf
pdftotext <pdf> <output>
pdfimages <pdf> <folder>
# then convert .ppm to .jpg
# one at a time:
# convert pdf-images-001.ppm pdf-images-001.jpg
# batch:
mogrify -format jpg *.ppm

# extract text from images
# adjust image for potentially better ocr:
# convert -sharpen 1 -brightness-contrast 3X30 input.jpg input.tiff
# large images might work better as well:
# convert input.png -resize 400% -type Grayscale input.tiff
# convert -density 200 -units PixelsPerInch -type Grayscale +compress image-of-text.png image-of-text.tiff
convert neitzche-quote.png -resize 400% -type Grayscale input.tiff ; tesseract input.tiff ocr-text-output -l eng ; open ocr-text-output.txt

$ pdffonts /Users/viatropos/Desktop/A\ Meta-Notation\ for\ Data\ Visualization.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
AYMUFJ+NimbusRomNo9L-Medi            Type 1            yes yes no       7  0
ADQTIV+NimbusRomNo9L-Regu            Type 1            yes yes no       8  0
SSLXCX+NimbusRomNo9L-Regu-Slant_167  Type 1            yes yes no      10  0
GRTKIU+NimbusRomNo9L-ReguItal        Type 1            yes yes no       9  0
HNPRVI+CMEX10                        Type 1            yes yes yes     25  0
LYGYSO+CMR10                         Type 1            yes yes no      20  0
EROJKM+CMSY10                        Type 1            yes yes no      19  0
EROJKM+CMSY10                        Type 1            yes yes yes     26  0
SHDPOJ+CMMI10                        Type 1            yes yes no      23  0
TFINCX+StandardSymL-Slant_167        Type 1            yes yes yes     22  0

Position of Character

var s = window.getSelection();
var r = s.getRangeAt(0);
r.getBoundingClientRect();
// at once:
window.getSelection().getRangeAt(0).getBoundingClientRect();
 
// set selection
// http://stackoverflow.com/questions/6190143/javascript-set-window-selection
var s = window.getSelection();
s.removeAllRanges();
var r = document.createRange();
//r.selectNodeContents(node)
r.setStart(node, 15);
r.setEnd(node);
s.addRange(r);
 
function setEndOfContenteditable(contentEditableElement)
{
    var range,selection;
    if(document.createRange)//Firefox, Chrome, Opera, Safari, IE 9+
    {
        range = document.createRange();//Create a range (a range is a like the selection but invisible)
        range.selectNodeContents(contentEditableElement);//Select the entire contents of the element with the range
        range.collapse(false);//collapse the range to the end point. false means collapse to end rather than the start
        selection = window.getSelection();//get the selection object (allows you to change selection)
        selection.removeAllRanges();//remove any selections already made
        selection.addRange(range);//make the range you have just created the visible selection
    }
    else if(document.selection)//IE 8 and lower
    { 
        range = document.body.createTextRange();//Create a range (a range is a like the selection but invisible)
        range.moveToElementText(contentEditableElement);//Select the entire contents of the element with the range
        range.collapse(false);//collapse the range to the end point. false means collapse to end rather than the start
        range.select();//Select the range (make it the visible selection
    }
}
 
// https://bug-23189-attachments.webkit.org/attachment.cgi?id=26527
// http://code.google.com/p/rangy/
document.designMode
 
var range, selection;
var node = $('#question p').get(0)
range = document.createRange();//Create a range (a range is a like the selection but invisible)
range.selectNodeContents(node);//Select the entire contents of the element with the range
range.collapse(false);//collapse the range to the end point. false means collapse to end rather than the start
selection = window.getSelection();//get the selection object (allows you to change selection)
selection.removeAllRanges();//remove any selections already made
selection.addRange(range);//make
selection.modify('move', 'left', 'character')
range = selection.getRangeAt(0);
var position = range.getClientRects()[0];
console.log(JSON.stringify(position));

Primary Visual Blocks

first find all nodes that are parent's width that add up in height to the parent.
then find all nodes in there that are parents width. if none/some are not parent's width, then you may be in a "leaf component".
header:
- width is parent width
  - inside header:
    - e.g.: there are two stacked nodes the same width as the parent, might be:
      - header + navigation
    - e.g.: nodes aren't as wide, then might be:
      - navigation or header
    - if the height is small enough, you still can't assume it's not the content area because what if the content area is just blank?
once you have the first level of vertical blocks, you can see test for headers/footers/navigation:
- headers potentially have:
  - large font
  - image/logo
- navigation potentially has:
  - many links (ideally list of links)
  - maybe nested links
  - background different than main
  - may be along the top or along the side (and may be on the side even if there is a top main header)
maybe at a certain depth the number of nodes increases
where the largest font is (and possibly h1), might be the main content area.
if there are multiple layers of main navigation, one layer may have a larger font and can be considered 'primary'.
write code that can tell you "here are the differences between the main areas"
maybe all headers are within a certain height.

Features

These are the machine learning features.

Booleans

hasLeftSidebar
hasRightSidebar
hasHeader
hasNavigation
hasFooter
hasContent
# if it's a single "resource" as a user sees it. 
contentHasArticle
# whether it is a list of items like in a gallery or not (opposite of hasArticle). 
contentHasList # contentHasGallery 
contentHasHeader
contentHasFooter
contentHasDifferentBackgroundColorThanBody
contentHasDifferentBackgroundColorThanSidebar
# leftSidebarIsWiderThanArticle 
# this is unusual, but it does exist. 
leftSidebarWidthIsGreaterThanContent
# leftSidebarIsTallerThanArticle 
leftSidebarHeightIsGreaterThanContent
# all non-leaf descendents of a block are x/y aligned 
frameIsXConsistent

Numbers

leftSidebarLinkCount
leftSidebarListCount
fontSize/color/family/style/case in each area
# and all the same for rightSidebar

Values

contentBackgroundColor
base colors for any main visual element

Unknowns

do links in main nav link to more external/subdomain paths than sidebar links (which might be more internal links)

Random Links

IEEE Articles (from berkeley)

GREAT: 6195393
- build properties tree: position, color, border, font family, font size, node name, and content of node
4285135

normalize.js

Normalize.js

Research

Keywords

Table Detection

List Detection

Ad Detection

Sentence Detection

Page Template / Fragment Detection

Content Extraction

Image Boundary Detection

Semantic Annotation

Code

Other

Tools

Notes

What it does (or should do)

Phases

Content Extraction Methodology

Examples

Comment Threads

Tools

Development

Example pages to train with

Tasks

Decision Tree

Features

Table Features

Tables for layout usually...

Tables for data usually...

Header Features

Text List item Features

Gallery List Item Features

Definition List Features

Quote Features

Sections of a Web Page

Visual Block Extraction Flow

Test Client Requirements

Test Phases

Goals

Tools

PDFtoTEXT

Position of Character

Primary Visual Blocks

Features

Booleans

Numbers

Values

Unknowns

Random Links

IEEE Articles (from berkeley)

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Last publish

Collaborators

Weekly Downloads