# Importing Wikipedia Dumps to Mysql

14 Jul 2013 - Michael Tsikerdekis

It can be quite frustrating adding Wikipedia Dumps in a local database. For some Wikipedias, such as the English Wikipedia, it takes a long time. This is a collection of scripts I've used to import Wikipedia dumps in Mysql.

Note: This guide is based on an Ubuntu server setup
Warning: The restoring process is likely to take weeks for large Wikipedias such as the English Wikipedia

## Step 1: Install MediaWiki

This is the quickest way to develop an almost blank mediawiki db used by Wikipedia. You will need a typical LAMP server

Mediawiki uses innoDB tables. By default, all innodb tables are saved under one file on the disk. The file cannot shrink and can cause problems. It's best to use an option of MySQL to create a seperate file on the disc per innodb table. To do this you need to do the following:

sudo /etc/my.cnf

Find the [mysqld] part in the config file and add:

innodb_file_per_table=1

Save the file and then on the terminal restart mysql:

service mysql restart

Note that any existing innodb tables will remain in the large ibfdata file but any newly created tables will be assigned a different file on the disk.

You will need to do some fine tuning on mysql for variables such as: innodb_buffer_pool_size, innodb_log_buffer_size, innodb_additional_mem_pool_size. You will have to investigate a bit to see what's best.

Now you need to install mediawiki. You need to follow the Quick Installation Guide for Mediawiki. For the most part if you know your root password for mysql mediawiki can setup automatically the database, tables, and the user (if you don't want to have root as your user accessing the database).

At this point, if you know exactly which columns on a table you are going to need, you may want to turn some fields in smaller versions, so that they can still exist and avoid errors, however they won't occupy as much space. A good example is the text table that contains two blob fields that track changes. If you are not interested in these changes, you could always turn these blob fields in varchar(2) or something else and save space.

## Step 2: Find dumps and retrieve the list

After the installation, you will need to figure out which dump contains the data that you want. There are many and they contain dumps for different tables. You can look some of the dumps here. If you want another language Wikipedia, you have to change "enwiki" to reflect the prefix of the language that you are interested in (e.g., elwiki for Greek Wikipedia, cswiki for Czech Wikipedia, eswiki for Spanish Wikipedia).

Use this script to create a filelist.txt file containing all files to be downloaded. You will need a proper regular expression to capture the names of the files automatically. As an alternative, you could type all file names manually in a file names filelist.txt. Also you will need to setup the url variable to the directory containing the dumps that interest you.

import urllib2re

#main url to retrieve files
url = "http://dumps.wikimedia.org/enwiki/20130102/"
f = urllib2.urlopen(url)

a = []
m = re.findall("(enwiki-\d*?-pages-meta-history\d+?.xml-p.+?\.7z)",data,re.DOTALL)
for item in m:
if item not in a:
a.append(item)

print a

f = open("filelist.txt","w")
for file in a:
f.write(file+"\n")
f.close()

## Step 3: Start retrieving and importing

For this step you will need to save a couple of scripts. You will also need the filelist.txt file from the previous step. I have instruction that you need to follow for some files. Also, you may need to install p7zip ubuntu package.

Save all of the following on your disk (same directory).

preimport.sql (source Brian Stempin)

SET autocommit=0;
SET unique_checks=0;
SET foreign_key_checks=0;
BEGIN;

postimport.sql (source Brian Stempin)

COMMIT;
SET autocommit=1;
SET unique_checks=1;
SET foreign_key_checks=1;

mwimport.pl (original source here)

- You can drastically speed up the import process by commenting the insert line that adds information for the text table. Look for "Comment this to save time" in the code below.
#!/usr/bin/perl -w

mwimport -- quick and dirty mediawiki importer

cat pages.xml | mwimport [-s N|--skip=N]

=cut

use strict;
use Getopt::Long;
use Pod::Usage;

my ($cnt_page$cnt_rev%namespace$ns_pattern); my$committed = 0;
my $skip = 0; ## set this to 1 to match "mwdumper --format=sql:1.5" as close as possible sub Compat() { 0 } # 512kB is what mwdumper uses, but 4MB gives much better performance here my$Buffer_Size = Compat ? 512*1024 : 4*1024*1024;

sub textify($) { my$l;
for ($_[0]) { if (defined$_) {
s/&quot;/"/ig;
s/&lt;/</ig;
s/&gt;/>/ig;
/&(?!amp;)(.*?;)/ and die "
textify: does not know &$1"; s/&amp;/&/ig;$l = length $_; s/\\/\\\\/g; s/\n/\\n/g; s/'/\\'/ig; Compat and s/" /\\"/ig;$_ = "
'$_'"; } else {$l = 0;
$_ = " ''"; } } return$l;
}

sub getline()
{
$_ = <>; defined$_ or die "
eof at line $.\n"; } sub ignore_elt($)
{
m|^\s*<$_[0]>.*?</$_[0]>\n$| or die " expected$_[0] element in line $.\n"; getline; } sub simple_elt($$) { if (m|^\s*<_[0]\s*/>\n|) { _[1]{_[0]} = ''; } elsif (m|^\s*<_[0]>(.*?)</_[0]>\n|) { _[1]{_[0]} = 1; } else { die " expected _[0] element in line .\n"; } getline; } sub simple_opt_elt($$) { if (m|^\s*<$_[0]\s*/>\n$|) {$_[1]{$_[0]} = ''; } elsif (m|^\s*<$_[0]>(.*?)</$_[0]>\n$|) {
$_[1]{$_[0]} = $1; } else { return; } getline; } sub redirect_elt($)
{
if (m|^\s*<redirect\s*title="
([^"]*)"\s*/>\n$|) { # " -- GeSHI syntax highlighting breaks on this line$_[0]{redirect} = $1; } else { simple_opt_elt redirect =>$_[0];
return;
}
getline;
}

sub opening_tag($) { m|^\s*<$_[0]>\n$| or die "expected$_[0] element in line $.\n"; getline; } sub closing_tag($)
{
m|^\s*</$_[0]>\n$| or die "$_[0]: expected closing tag in line$.\n";
getline;
}

sub si_nss_namespace()
{
m|^\s*<namespace key="(-?\d+)"[^/]*?/>()\n|
or m|^\s*<namespace key="(-?\d+)"[^>]*?>(.*?)</namespace>\n|
or die "expected namespace element in line $.\n";$namespace{$2} =$1;
getline;
}

sub si_namespaces()
{
opening_tag("namespaces");
eval {
while (1) {
si_nss_namespace;
}
};
# note: [email protected] is always defined
[email protected] =~ /^expected namespace element / or die "namespaces: [email protected]";
$ns_pattern = '^('.join('|',map { quotemeta } keys %namespace).'):'; closing_tag("namespaces"); } sub siteinfo() { opening_tag("siteinfo"); eval { my %site; simple_elt sitename => \%site; simple_elt base => \%site; simple_elt generator => \%site;$site{generator} =~ /^MediaWiki 1.20wmf1$/ or warn("siteinfo: untested generator '$site{generator}',",
simple_elt case => \%site;
si_namespaces;
print "-- MediaWiki XML dump converted to SQL by mwimport
BEGIN;

-- Site: $site{sitename} -- URL:$site{base}
-- Generator: $site{generator} -- Case:$site{case}
--
-- Namespaces:
"
,map { "-- $namespace{$_}: $_\n" } sort {$namespace{$a} <=>$namespace{$b} } keys %namespace; }; [email protected] and die "siteinfo: [email protected]"; closing_tag("siteinfo"); } sub pg_rv_contributor($)
{
if (m|^\s*<contributor deleted="deleted"\s*/>\s*\n|) {
getline;
} else {
opening_tag "contributor";
my %c;
eval {
simple_elt id => \%c;
$_[0]{contrib_user} =$c{username};
$_[0]{contrib_id} =$c{id};
};
if ([email protected]) {
[email protected] =~ /^expected username element / or die "contributor: [email protected]";
eval {
simple_elt ip => \%c;
$_[0]{contrib_user} =$c{ip};
};
[email protected] and die "contributor: [email protected]";
}
closing_tag "contributor";
}
}

sub pg_rv_comment($) { if (m|^\s*<comment\s*/>\s*\n|) { getline; } elsif (m|^\s*<comment deleted="deleted"\s*/>\s*\n|) { getline; } elsif (s|^\s*<comment>([^<]*)||g) { while (1) {$_[0]{comment} .= $1; last if$_;
getline;
s|^([^<]*)||;
}
closing_tag "comment";
} else {
return;
}
}

sub pg_rv_text($) { if (m|^\s*<text xml:space="preserve"\s*/>\s*\n|) {$_[0]{text} = '';
getline;
} elsif (m|^\s*<text deleted="deleted"\s*/>\s*\n|) {
$_[0]{text} = ''; getline; } elsif (s|^\s*<text xml:space="preserve">([^<]*)||g) { while (1) {$_[0]{text} .= $1; last if$_;
getline;
s|^([^<]*)||;
}
closing_tag "text";
} else {
die "expected text element in line $.\n"; } } my$start = time;

sub stats()
{
my $s = time -$start;
$s ||= 1; printf STDERR "%9d pages (%7.3f/s), %9d revisions (%7.3f/s) in %d seconds\n",$cnt_page$cnt_page/$s$cnt_rev$cnt_rev/$s$s;
}

### flush_rev($text,$rev, $page) sub flush_rev($$) { _[0] or return; for my i (0,1,2) { _[i] =~ s/,\n?//; } print "INSERT INTO text(old_id,old_text,old_flags) VALUES _[0];\n"#Comment this to save time _[2] and print "INSERT INTO page(page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES _[2];\n"; print "INSERT INTO revision(rev_id,rev_page,rev_text_id,rev_comment,rev_user,rev_user_text,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id) VALUES _[1];\n"; for my i (0,1,2) { _[i] = ''; } } ### flush(text, rev, page) sub flush($$$)
{
flush_rev $_[0]$_[1]$_[2]; print "COMMIT;\n";$committed = $cnt_page; } ### pg_revision(\%page,$skip, $text,$rev, $page) sub pg_revision($)
{
my $rev = {}; opening_tag "revision"; eval { my %revision; simple_elt id =>$rev;
simple_opt_elt parentid => $rev; simple_elt timestamp =>$rev;
pg_rv_contributor $rev; simple_opt_elt minor =>$rev;
pg_rv_comment $rev; pg_rv_text$rev;
simple_opt_elt sha1 => $rev; simple_opt_elt model =>$rev;
simple_opt_elt format => $rev; }; [email protected] and die "revision: [email protected]"; closing_tag "revision";$_[1] and return;
$$rev{id} =~ /^\d+/ or return warn("page '_[0]{title}': ignoring bogus revision id '$$rev{id}'\n");
$_[0]{latest_len} = textify $$rev{text}; for my f (qw(comment contrib_user)) { textify$$rev{$f};
}
$$rev{timestamp} =~ s/^(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)Z/'123456'/ or return warn("page '_[0]{title}' rev$$rev{id}: ",
"bogus timestamp '$$rev{timestamp}'\n"); _[2] .= "($$rev{id},$$rev{text},'utf-8'),\n";$$rev{minor} = defined $$rev{minor} ? 1 : 0; _[3] .= "($$rev{id},$_[0]{id},$$rev{id},$$rev{comment}," .($$rev{contrib_id}||0) .",$$rev{contrib_user},$$rev{timestamp},$$rev{minor},0,$_[0]{latest_len},$_[0]{latest}),\n";$_[0]{latest} = $$rev{id}; _[0]{latest_start} = substr$$rev{text}060;
if (length $_[2] >$Buffer_Size) {
flush_rev $_[2]$_[3]$_[4];$_[0]{do_commit} = 1;
}
++$cnt_rev % 1000 == 0 and stats; } ### page($text, $rev,$page)
sub page($) { opening_tag "page"; my %page; ++$cnt_page;
eval {
simple_elt title => \%page;
simple_opt_elt ns => \%page;
simple_elt id => \%page;
redirect_elt \%page;
simple_opt_elt restrictions => \%page;
$page{latest} = 0; while (1) { pg_revision \%page$skip$_[0]$_[1]$_[2]; } }; # note: [email protected] is always defined [email protected] =~ /^expected revision element / or die "page: [email protected]"; closing_tag "page"; if ($skip) {
--$skip; } else {$page{id} =~ /^\d+$/ or warn("page '$page{title}': bogus id '$page{id}'\n"); my$ns;
if ($page{title} =~ s/$ns_pattern//o) {
$ns =$namespace{$1}; } else {$ns = 0;
}
for my $f (qw(title restrictions)) { textify$page{$f}; } if (Compat) {$page{redirect} = $page{latest_start} =~ /^'#(?:REDIRECT|redirect) / ? 1 : 0; } else {$page{redirect} = $page{latest_start} =~ /^' #REDIRECT /i ? 1 : 0; }$page{title} =~ y/ /_/;
if (Compat) {
$_[2] .= "($page{id},$ns,$page{title},$page{restrictions},0," ."$page{redirect},0,RAND(),"
."$page{latest},$page{latest_len}),\n";
} else {
$_[2] .= "($page{id},$ns,$page{title},$page{restrictions},0," ."$page{redirect},0,RAND(),NOW()+0,$page{latest},$page{latest_len}),\n";
}
if ($page{do_commit}) { flush$_[0]$_[1]$_[2];
print "BEGIN;\n";
}
}
}

sub terminate
{
die "terminated by SIG$_[0]\n"; } my$SchemaVer = '0.8';
my $SchemaLoc = "http://www.mediawiki.org/xml/export-$SchemaVer/";
my $Schema = "http://www.mediawiki.org/xml/export-$SchemaVer.xsd";

my $help; GetOptions("skip=i" => \$skip,
"help"               => \$help) or pod2usage(2);$help and pod2usage(1);

getline;
m|^<mediawiki \Qxmlns="$SchemaLoc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="$SchemaLoc $Schema" version="$SchemaVer"\E xml:lang="..">$| or die "unknown schema or invalid first line\n"; getline;$SIG{TERM} = $SIG{INT} = \&terminate; siteinfo; my ($text$rev$page) = ('''''');
eval {
while (1) {
page $text$rev$page; } }; [email protected] =~ /^expected page element / or die "[email protected] (committed$committed pages)\n";
flush $text$rev$page; stats; m|</mediawiki>| or die "mediawiki: expected closing tag in line$.\n";

This program is free software; you can redistribute it and/or modify
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

You may also redistribute and/or modify this software under the terms
of the GNU Free Documentation License without invariant sections, and
without front-cover or back-cover texts.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

adddb.sh - You will need to change the urll variable to the url that you will be using. Wikipedia releases some files under bz2 compression and other files under 7z. I left a line commented in this script that you need to enable in case your files are bz2 and not 7z (don't forget to comment the line underneath that that deals with 7z). Finally, you will need to add your mysql username password and database.

!/bin/bash
urll='http://dumps.wikimedia.org/enwiki/20130102/'

for i in $( cat filelist.txt )do echo retrieving:$i
wget $urll$i
echo mwimport: $i #bzcat$i | perl mwimport.pl > temp.sql
7za e -so $i |perl mwimport.pl > temp.sql result=$( tail temp.sql -n1 )
result1='COMMIT;'
echo checking...
echo $result if [ "$result" = "$result1" ] then echo OKAY:$i
rm $i cat preimport.sql temp.sql postimport.sql | mysql -f -u <USER IN MYSQL> -p<PASSWORD> --database=<MEDIAWIKI DB> rm temp.sql else echo FAIL:$i
exit
fi
done

## Step 4: Run "sh adddb.sh"

The script will start retrieving the first file in filelist.txt, extract it and process it using mwimport.pl, check on whether the extraction is successful and finally add the information to the database. After that, it will delete the file and carry on with the second file in filelist.txt all the way to the end.

You will need to do this preferably under a screen. If you don't have it installed:

sudo aptitude install screen
screen -t Restoringwiki