A whole Wikipedia dump, in mongodb.
put your hefty wikipedia dump into mongo, with fully-parsed wikiscript - without thinking, without loading it into memory, grepping, unzipping, or other crazy command-line nonsense.
It's a javascript one-liner that puts a highly-queryable wikipedia on your laptop in a nice afternoon.
It uses wtf_wikipedia to parse wikiscript into almost-nice json.
npm install -g wikipedia-to-mongodb
⚡ From the Command-Line:
wp2mongo /path/to/my-wikipedia-article-dump.xml.bz2
😎 From a nodejs script
var wp2mongo =
then check out the articles in mongo:
$ mongo #enter the mongo shell use enwiki #grab the database db.wikipedia.find[0].categories#[ "Former colonial capitals in Canada", # "Populated places established in 1793" ...] db.wikipedia.count# 124,999...
Steps:
1) 💪 you can do this.
you can do this. a few Gb. you can do this.
2) get ready
Install nodejs, mongodb, and optionally redis
# start mongo mongod --config /mypath/to/mongod.conf# install wp2mongo npm install -g wikipedia-to-mongodb
that gives you the global command wp2mongo
.
3) download a wikipedia
The Afrikaans wikipedia (around 47,000 artikels) only takes a few minutes to download, and 10 mins to load into mongo on a macbook:
# dowload an xml dump (38mb, couple minutes) wget https://dumps.wikimedia.org/afwiki/latest/afwiki-latest-pages-articles.xml.bz2
the english/german ones are bigger. Use whichever xml dump you'd like. The download page is weird, but you'll want the most-common dump format, without historical diffs, or images, which is ${LANG}wiki-latest-pages-articles.xml.bz2
4) get it going
#load it into mongo (10-15 minutes) wp2mongo ./afwiki-latest-pages-articles.xml.bz2
5) take a bath
just put some epsom salts in there, it feels great. You deserve a break once and a while. The en-wiki dump should take a few hours. Should be done before dinner.
6) check-out your data
to view your data in the mongo console,
$ mongouse af_wikipedia //shows a random pagedbwikipedia //count the redirects (~5,000 in afrikaans)dbwikipedia //find a specific pagedbwikipediacategories
Same for the English wikipedia:
the english wikipedia will work under the same process, but the download will take an afternoon, and the loading/parsing a couple hours. The en wikipedia dump is a 13 GB (for enwiki-20170901-pages-articles.xml.bz2), and becomes a pretty legit mongo collection uncompressed. It's something like 51GB, but mongo can do it... You can do it!
Options
human-readable plaintext --plaintext
/*[{ _id:'Toronto', title:'Toronto', plaintext:'Toronto is the most populous city in Canada and the provincial capital...'}]*/
go faster with Redis --worker
there is yet much faster way (even x10) to import all pages into mongodb but a little more complex. it requires redis installed on your computer and running worker in separate process.
It also gives you a cool dashboard, to watch the progress.
# install redis sudo apt-get install # (or `brew install redis` on a mac) # clone the repo git clone git@github.com:spencermountain/wikipedia-to-mongodb.git && cd wikipedia-to-mongodb #load pages into job queue bin/wp2mongo.js ./afwiki-latest-pages-articles.xml.bz2 --worker # start processing jobs (parsing articles and saving to mongodb) on all CPU's node src/worker.js # you can preview processing jobs in kue dashboard (localhost:3000) node node_modules/kue/bin/kue-dashboard -p 3000
skip unnecessary pages --skip_disambig, --skip_redirects
this can make it go faster too, by skipping entries in the dump that aren't full-on articles.
let obj = file: './path/enwiki-latest-pages-articles.xml.bz2' db: 'enwiki' skip_redirects: true skip_disambig: true skip_first: 1000 // ignore the first 1k pages verbose: true // print each article title
how it works:
this library uses:
-
unbzip2-stream to stream-uncompress the gnarly bz2 file
-
xml-stream to stream-parse its xml format
-
wtf_wikipedia to brute-parse the article wikiscript contents into JSON.
-
redis to (optionally) put wikiscript parsing on separate threads 🤘
Addendum:
_ids
since wikimedia makes all pages have globally unique titles, we also use them for the mongo _id
fields.
The benefit is that if it crashes half-way through, or if you want to run it again, running this script repeatedly will not multiply your data. We do a 'upsert' on the record.
encoding special characters
mongo has some opinions on special-characters in some of its data. It is weird, but we're using this standard(ish) form of encoding them:
\ --> \\
$ --> \u0024
. --> \u002e
Non-wikipedias
This library should also work on other wikis with standard xml dumps from MediaWiki. I haven't tested them, but the wtf_wikipedia supports all sorts of non-standard wiktionary/wikivoyage templates, and if you can get a bz-compressed xml dump from your wiki, this should work fine. Open an issue if you find something weird.
PRs welcome!
MIT