WikiFetch
Author: @benjamincoe
Problem
For some NLP research I'm currently doing, I was interested in parsing structured information from Wikipedia articles.
I did not want to use a full-featured MediaWiki parser:
- this would be heavy-handed, all I really wanted was: the text contents from articles, images, and links to other articles.
- I wanted to be able to extend the approach to other websites, e.g., news sites.
- I wanted to use a crawler-based approach, rather than downloading a massive dataset.
The Solution
WikiFetch Crawls a Wikipedia article using Node.js and jQuery. It returns a structured JSON-representation of the page:
{ "title": "Foobar Article", "links": { "Link_to_another_article: { "text": "Another article.", // the text that was linked. "title": "Another_article.", // title attribute <a/> tag. "occurrences": 1 // number of times this article was linked. } }, "sections": { "Section Heading": { text: "text contents of section.", images: ["http://foobar.jpg"] // images occurring within this section. } } }
- Links within sections are replaced with [[article name]], which will have a corresponding entry in links.
Usage
npm install wikifetch -gwikifetch --article=Dog