Importing Wikipedia Dumps to Mysql

14 Jul 2013 - Michael Tsikerdekis

It can be quite frustrating adding Wikipedia Dumps in a local database. For some Wikipedias, such as the English Wikipedia, it takes a long time. This is a collection of scripts I've used to import Wikipedia dumps in Mysql.

Note: This guide is based on an Ubuntu server setup
Warning: The restoring process is likely to take weeks for large Wikipedias such as the English Wikipedia

Step 1: Install MediaWiki

This is the quickest way to develop an almost blank mediawiki db used by Wikipedia. You will need a typical LAMP server.

Mediawiki uses innoDB tables. By default, all innodb tables are saved under one file on the disk. The file cannot shrink and can cause problems. It's best to use an option of MySQL to create a seperate file on the disc per innodb table. To do this you need to do the following:

 sudo /etc/my.cnf

Find the [mysqld] part in the config file and add:

 innodb_file_per_table=1

Save the file and then on the terminal restart mysql:

 service mysql restart

Note that any existing innodb tables will remain in the large ibfdata file but any newly created tables will be assigned a different file on the disk.

You will need to do some fine tuning on mysql for variables such as: innodb_buffer_pool_size, innodb_log_buffer_size, innodb_additional_mem_pool_size. You will have to investigate a bit to see what's best.

Now you need to install mediawiki. You need to follow the Quick Installation Guide for Mediawiki. For the most part if you know your root password for mysql mediawiki can setup automatically the database, tables, and the user (if you don't want to have root as your user accessing the database).

At this point, if you know exactly which columns on a table you are going to need, you may want to turn some fields in smaller versions, so that they can still exist and avoid errors, however they won't occupy as much space. A good example is the text table that contains two blob fields that track changes. If you are not interested in these changes, you could always turn these blob fields in varchar(2) or something else and save space.

Step 2: Find dumps and retrieve the list

After the installation, you will need to figure out which dump contains the data that you want. There are many and they contain dumps for different tables. You can look some of the dumps here. If you want another language Wikipedia, you have to change "enwiki" to reflect the prefix of the language that you are interested in (e.g., elwiki for Greek Wikipedia, cswiki for Czech Wikipedia, eswiki for Spanish Wikipedia).

Use this script to create a filelist.txt file containing all files to be downloaded. You will need a proper regular expression to capture the names of the files automatically. As an alternative, you could type all file names manually in a file names filelist.txt. Also you will need to setup the url variable to the directory containing the dumps that interest you.

 import urllib2, re
 
 #main url to retrieve files
 url = "http://dumps.wikimedia.org/enwiki/20130102/"
 f = urllib2.urlopen(url)
 data = f.read()
 
 a = []
 m = re.findall("(enwiki-\d*?-pages-meta-history\d+?.xml-p.+?\.7z)",data,re.DOTALL)
 for item in m:
         if item not in a:
                 a.append(item)
 
 print a
 
 f = open("filelist.txt","w")
 for file in a:
         f.write(file+"\n")
 f.close()

Step 3: Start retrieving and importing

For this step you will need to save a couple of scripts. You will also need the filelist.txt file from the previous step. I have instruction that you need to follow for some files. Also, you may need to install p7zip ubuntu package.

Save all of the following on your disk (same directory).

preimport.sql (source Brian Stempin)

 SET autocommit=0;
 SET unique_checks=0;
 SET foreign_key_checks=0;
 BEGIN;

postimport.sql (source Brian Stempin)

 COMMIT;
 SET autocommit=1;
 SET unique_checks=1;
 SET foreign_key_checks=1;

mwimport.pl (original source here)

 - You can drastically speed up the import process by commenting the insert line that adds information for the text table. Look for "Comment this to save time" in the code below.
 #!/usr/bin/perl -w
 =head1 NAME
  
 mwimport -- quick and dirty mediawiki importer
  
 =head1 SYNOPSIS
  
 cat pages.xml | mwimport [-s N|--skip=N]
  
 =cut
  
 use strict;
 use Getopt::Long;
 use Pod::Usage;
  
 my ($cnt_page, $cnt_rev, %namespace, $ns_pattern);
 my $committed = 0;
 my $skip = 0;
  
 ## set this to 1 to match "mwdumper --format=sql:1.5" as close as possible
 sub Compat() { 0 }
  
 # 512kB is what mwdumper uses, but 4MB gives much better performance here
 my $Buffer_Size = Compat ? 512*1024 : 4*1024*1024;
  
 sub textify($)
 {
   my $l;
   for ($_[0]) {
     if (defined $_) {
       s/&quot;/"/ig;
       s/&lt;/</ig;
       s/&gt;/>/ig;
       /&(?!amp;)(.*?;)/ and die "textify: does not know &$1";
       s/&amp;/&/ig;
       $l = length $_;
       s/\\/\\\\/g;
       s/\n/\\n/g;
       s/'/\\'/ig;
       Compat and s/"/\\"/ig;
       $_ = "'$_'";
     } else {
       $l = 0;
       $_ = "''";
     }
   }
   return $l;
 }
  
 sub getline()
 {
   $_ = <>;
   defined $_ or die "eof at line $.\n";
 }
  
 sub ignore_elt($)
 {
   m|^\s*<$_[0]>.*?</$_[0]>\n$| or die "expected $_[0] element in line $.\n";
   getline;
 }
  
 sub simple_elt($$)
 {
   if (m|^\s*<$_[0]\s*/>\n$|) {
     $_[1]{$_[0]} = '';
   } elsif (m|^\s*<$_[0]>(.*?)</$_[0]>\n$|) {
     $_[1]{$_[0]} = $1;
   } else {
     die "expected $_[0] element in line $.\n";
   }
   getline;
 }
  
 sub simple_opt_elt($$)
 {
   if (m|^\s*<$_[0]\s*/>\n$|) {
     $_[1]{$_[0]} = '';
   } elsif (m|^\s*<$_[0]>(.*?)</$_[0]>\n$|) {
     $_[1]{$_[0]} = $1;
   } else {
     return;
   }
   getline;
 }
  
 sub redirect_elt($)
 {
   if (m|^\s*<redirect\s*title="([^"]*)"\s*/>\n$|) { # " -- GeSHI syntax highlighting breaks on this line
     $_[0]{redirect} = $1;
   } else {
     simple_opt_elt redirect => $_[0];
     return;
   }
   getline;
 }
  
 sub opening_tag($)
 {
   m|^\s*<$_[0]>\n$| or die "expected $_[0] element in line $.\n";
   getline;
 }
  
 sub closing_tag($)
 {
   m|^\s*</$_[0]>\n$| or die "$_[0]: expected closing tag in line $.\n";
   getline;
 }
  
 sub si_nss_namespace()
 {
   m|^\s*<namespace key="(-?\d+)"[^/]*?/>()\n|
     or m|^\s*<namespace key="(-?\d+)"[^>]*?>(.*?)</namespace>\n|
     or die "expected namespace element in line $.\n";
   $namespace{$2} = $1;
   getline;
 }
  
 sub si_namespaces()
 {
   opening_tag("namespaces");
   eval {
     while (1) {
       si_nss_namespace;
     }
   };
   # note: $@ is always defined
   $@ =~ /^expected namespace element / or die "namespaces: $@";
   $ns_pattern = '^('.join('|',map { quotemeta } keys %namespace).'):';
   closing_tag("namespaces");
 }
  
 sub siteinfo()
 {
   opening_tag("siteinfo");
   eval {
     my %site;
     simple_elt sitename => \%site;
     simple_elt base => \%site;
     simple_elt generator => \%site;
     $site{generator} =~ /^MediaWiki 1.20wmf1$/
       or warn("siteinfo: untested generator '$site{generator}',",
               " expect trouble ahead\n");
     simple_elt case => \%site;
     si_namespaces;
     print "-- MediaWiki XML dump converted to SQL by mwimport
 BEGIN;
  
 -- Site: $site{sitename}
 -- URL: $site{base}
 -- Generator: $site{generator}
 -- Case: $site{case}
 --
 -- Namespaces:
 ",map { "-- $namespace{$_}: $_\n" }
   sort { $namespace{$a} <=> $namespace{$b} } keys %namespace;
   };
   $@ and die "siteinfo: $@";
   closing_tag("siteinfo");
 }
  
 sub pg_rv_contributor($)
 {
   if (m|^\s*<contributor deleted="deleted"\s*/>\s*\n|) {
     getline;
   } else {
     opening_tag "contributor";
     my %c;
     eval {
       simple_elt username => \%c;
       simple_elt id => \%c;
       $_[0]{contrib_user} = $c{username};
       $_[0]{contrib_id}   = $c{id};
     };
     if ($@) {
       $@ =~ /^expected username element / or die "contributor: $@";
       eval {
         simple_elt ip => \%c;
         $_[0]{contrib_user} = $c{ip};
       };
       $@ and die "contributor: $@";
     }
     closing_tag "contributor";
   }
 }
  
 sub pg_rv_comment($)
 {
   if (m|^\s*<comment\s*/>\s*\n|) {
     getline;
   } elsif (m|^\s*<comment deleted="deleted"\s*/>\s*\n|) {
     getline;
   } elsif (s|^\s*<comment>([^<]*)||g) {
     while (1) {
       $_[0]{comment} .= $1;
       last if $_;
       getline;
       s|^([^<]*)||;
     }
     closing_tag "comment";
   } else {
     return;
   }
 }
  
 sub pg_rv_text($)
 {
   if (m|^\s*<text xml:space="preserve"\s*/>\s*\n|) {
     $_[0]{text} = '';
     getline;
   } elsif (m|^\s*<text deleted="deleted"\s*/>\s*\n|) {
     $_[0]{text} = '';
     getline;
   } elsif (s|^\s*<text xml:space="preserve">([^<]*)||g) {
     while (1) {
       $_[0]{text} .= $1;
       last if $_;
       getline;
       s|^([^<]*)||;
     }
     closing_tag "text";
   } else {
     die "expected text element in line $.\n";
   }
 }
  
 my $start = time;
  
 sub stats()
 {
   my $s = time - $start;
   $s ||= 1;
   printf STDERR "%9d pages (%7.3f/s), %9d revisions (%7.3f/s) in %d seconds\n",
     $cnt_page, $cnt_page/$s, $cnt_rev, $cnt_rev/$s, $s;
 }
  
 ### flush_rev($text, $rev, $page)
 sub flush_rev($$$)
 {
   $_[0] or return;
   for my $i (0,1,2) {
     $_[$i] =~ s/,\n?$//;
   }
   print "INSERT INTO text(old_id,old_text,old_flags) VALUES $_[0];\n"; #Comment this to save time
   $_[2] and print "INSERT INTO page(page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES $_[2];\n";
   print "INSERT INTO revision(rev_id,rev_page,rev_text_id,rev_comment,rev_user,rev_user_text,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id) VALUES $_[1];\n";
   for my $i (0,1,2) {
     $_[$i] = '';
   }
 }
  
 ### flush($text, $rev, $page)
 sub flush($$$)
 {
   flush_rev $_[0], $_[1], $_[2];
   print "COMMIT;\n";
   $committed = $cnt_page;
 }
  
 ### pg_revision(\%page, $skip, $text, $rev, $page)
 sub pg_revision($$$$$)
 {
   my $rev = {};
   opening_tag "revision";
   eval {
     my %revision;
     simple_elt id => $rev;
     simple_opt_elt parentid => $rev;
     simple_elt timestamp => $rev;
     pg_rv_contributor $rev;
     simple_opt_elt minor => $rev;
     pg_rv_comment $rev;
     pg_rv_text $rev;
     simple_opt_elt sha1 => $rev;
     simple_opt_elt model => $rev;
     simple_opt_elt format => $rev;
   };
   $@ and die "revision: $@";
   closing_tag "revision";
   $_[1] and return;
   $$rev{id} =~ /^\d+$/ or return
     warn("page '$_[0]{title}': ignoring bogus revision id '$$rev{id}'\n");
   $_[0]{latest_len} = textify $$rev{text};
   for my $f (qw(comment contrib_user)) {
     textify $$rev{$f};
   }
   $$rev{timestamp} =~
     s/^(\d\d\d\d)-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)Z$/'$1$2$3$4$5$6'/
       or return warn("page '$_[0]{title}' rev $$rev{id}: ",
                      "bogus timestamp '$$rev{timestamp}'\n");
   $_[2] .= "($$rev{id},$$rev{text},'utf-8'),\n";
   $$rev{minor} = defined $$rev{minor} ? 1 : 0;
   $_[3] .= "($$rev{id},$_[0]{id},$$rev{id},$$rev{comment},"
     .($$rev{contrib_id}||0)
     .",$$rev{contrib_user},$$rev{timestamp},$$rev{minor},0,$_[0]{latest_len},$_[0]{latest}),\n";
   $_[0]{latest} = $$rev{id};
   $_[0]{latest_start} = substr $$rev{text}, 0, 60;
   if (length $_[2] > $Buffer_Size) {
     flush_rev $_[2], $_[3], $_[4];
     $_[0]{do_commit} = 1;
   }
   ++$cnt_rev % 1000 == 0 and stats;
 }
  
 ### page($text, $rev, $page)
 sub page($$$)
 {
   opening_tag "page";
   my %page;
   ++$cnt_page;
   eval {
     simple_elt title => \%page;
     simple_opt_elt ns => \%page;
     simple_elt id => \%page;
     redirect_elt \%page;
     simple_opt_elt restrictions => \%page;
     $page{latest} = 0;
     while (1) {
       pg_revision \%page, $skip, $_[0], $_[1], $_[2];
     }
   };
   # note: $@ is always defined
   $@ =~ /^expected revision element / or die "page: $@";
   closing_tag "page";
   if ($skip) {
     --$skip;
   } else {
     $page{id} =~ /^\d+$/
       or warn("page '$page{title}': bogus id '$page{id}'\n");
     my $ns;
     if ($page{title} =~ s/$ns_pattern//o) {
       $ns = $namespace{$1};
     } else {
       $ns = 0;
     }
     for my $f (qw(title restrictions)) {
       textify $page{$f};
     }
     if (Compat) {
       $page{redirect} = $page{latest_start} =~ /^'#(?:REDIRECT|redirect) / ?
         1 : 0;
     } else {
       $page{redirect} = $page{latest_start} =~ /^'#REDIRECT /i ? 1 : 0;
     }
     $page{title} =~ y/ /_/;
     if (Compat) {
       $_[2] .= "($page{id},$ns,$page{title},$page{restrictions},0,"
         ."$page{redirect},0,RAND(),"
           ."DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,"
             ."$page{latest},$page{latest_len}),\n";
     } else {
       $_[2] .= "($page{id},$ns,$page{title},$page{restrictions},0,"
         ."$page{redirect},0,RAND(),NOW()+0,$page{latest},$page{latest_len}),\n";
     }
     if ($page{do_commit}) {
       flush $_[0], $_[1], $_[2];
       print "BEGIN;\n";
     }
   }
 }
  
 sub terminate
 {
   die "terminated by SIG$_[0]\n";
 }
  
 my $SchemaVer = '0.8';
 my $SchemaLoc = "http://www.mediawiki.org/xml/export-$SchemaVer/";
 my $Schema    = "http://www.mediawiki.org/xml/export-$SchemaVer.xsd";
  
 my $help;
 GetOptions("skip=i"             => \$skip,
            "help"               => \$help) or pod2usage(2);
 $help and pod2usage(1);
  
 getline;
 m|^<mediawiki \Qxmlns="$SchemaLoc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="$SchemaLoc $Schema" version="$SchemaVer"\E xml:lang="..">$|
   or die "unknown schema or invalid first line\n";
 getline;
 $SIG{TERM} = $SIG{INT} = \&terminate;
 siteinfo;
 my ($text, $rev, $page) = ('', '', '');
 eval {
   while (1) {
     page $text, $rev, $page;
   }
 };
 $@ =~ /^expected page element / or die "$@ (committed $committed pages)\n";
 flush $text, $rev, $page;
 stats;
 m|</mediawiki>| or die "mediawiki: expected closing tag in line $.\n";
  
 =head1 COPYRIGHT
  
 Copyright 2007 by Robert Bihlmeyer
  
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2 of the License, or
 (at your option) any later version.
  
 You may also redistribute and/or modify this software under the terms
 of the GNU Free Documentation License without invariant sections, and
 without front-cover or back-cover texts.
  
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.

adddb.sh - You will need to change the urll variable to the url that you will be using. Wikipedia releases some files under bz2 compression and other files under 7z. I left a line commented in this script that you need to enable in case your files are bz2 and not 7z (don't forget to comment the line underneath that that deals with 7z). Finally, you will need to add your mysql username password and database.

 !/bin/bash
 urll='http://dumps.wikimedia.org/enwiki/20130102/'
 
 for i in $( cat filelist.txt ); do
         echo retrieving: $i
         wget $urll$i
         echo mwimport: $i
         #bzcat $i | perl mwimport.pl > temp.sql
         7za e -so $i |perl mwimport.pl > temp.sql
         result=$( tail temp.sql -n1 )
         result1='COMMIT;'
         echo checking...
         echo $result
         if [ "$result" = "$result1" ]
         then
                 echo OKAY: $i
                 rm $i
                 cat preimport.sql temp.sql postimport.sql | mysql -f -u <USER IN MYSQL> -p<PASSWORD> --database=<MEDIAWIKI DB>
                 rm temp.sql
         else
                 echo FAIL: $i
                 exit
         fi
 done

Step 4: Run "sh adddb.sh"

The script will start retrieving the first file in filelist.txt, extract it and process it using mwimport.pl, check on whether the extraction is successful and finally add the information to the database. After that, it will delete the file and carry on with the second file in filelist.txt all the way to the end.

You will need to do this preferably under a screen. If you don't have it installed:

 sudo aptitude install screen
 screen -t Restoringwiki
 sh adddb.sh

Ctrl+A+D will detach the screen. It's still active in the background. To view the progress you can enter that screen again by using "screen -r " and hit tab to get the number of the screen automatically.