My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
MySQL_import  
How-to import Wiktionary database into local MySQL database.
MySQL, xml2sql, import
Updated May 15, 2013 by [email protected]

Introduction

At this moment, there is more information about import here: synarcher.

MediaWiki

My beloved WAMP tutorial pages: Ultimate How To, in Russian.

Download MediaWiki. Unzip it, e.g. to:

D:\all\inetpub\mediawiki-1.16.0.enwikt.20100824 

Open file httpd.conf

gvim "C:\w\Apache Software Foundation\Apache2.2\conf\httpd.conf" 

Add the following text to httpd.conf:

<Directory "D:/all/inetpub">
    Order Allow,Deny
    Allow from All
</Directory>

NameVirtualHost *:80
<VirtualHost *:80>
    DocumentRoot D:/all/inetpub/mediawiki-1.16.0.enwikt.20100824
    ServerName  enwikt
    ErrorLog D:/all/inetpub/logs/wikt_ru_error_log.txt
    CustomLog D:/all/inetpub/logs/wikt_ru_log.txt common
</VirtualHost>
  • Restart Apache.
  • Open in browser http://localhost/mw-config/index.php.
  • Set up the wiki:
    • select database name, e.g. enwikt20100824.
    • select storage engine: InnoDB MyISAM.
    • select database character set: Binary.

Add the line: $wgCapitalLinks=false; at the end of file LocalSettings.php.

Xml2sql

Fast upload dump file (e.g. enwiktionary-20100824-pages-articles.xml) to MySQL.

Errors (Redirect, DiscussionThreading)

Intro

There are problems with "<redirect />", "<DiscussionThreading>", "<ns>" and "<sha1 />" in the XML file.

The error message is:

./xml2sql unexpected element <redirect> ./xml2sql: parsing aborted at line 33 pos 16.

Solution

Strip of the tags (Redirect and others) by using a perl script `xml2sql_helper.pl` with the following format:

perl xml2sql_helper.pl in_file out_file

Where in_file is enwiktionary-20100824-pages-articles.xml.

Xml2sql

Download Xml2sql.

$ cd /var/lib/mediawiki-1.16.0.enwikt.20100824/
$ bzip2 -d enwiktionary-20100824-pages-articles.xml.bz2
$ xml2sql -m enwiktionary-20100824-pages-articles.xml

Xml2sql generates three files: page.sql, revision.sql and text.sql

MySQL

$ mysql -u root -p
USE enwikt20100824
mysql$ SOURCE /temp/page.sql
mysql$ SOURCE /temp/text.sql
mysql$ SOURCE /temp/revision.sql 

revision.sql

If there are the errors (due to SOURCE revision.sql):

  • "ERROR 1136 (21S01): Column count doesn't match value count at row 1",
  • "ERROR 1406 (22001): Data too long for column 'rev_comment',
then try the following commands (see MediaWiki/Talk:Revision_table):

remove three fields (rev_len, rev_parent_id, rev_sha1), change rev_comment from TINYBLOB to BLOB, load by SOURCE, and return (add) three fields back:

mysql$ ALTER TABLE revision DROP COLUMN rev_len, DROP COLUMN rev_parent_id, DROP COLUMN rev_sha1; 
mysql$ ALTER TABLE revision CHANGE COLUMN `rev_comment` `rev_comment` BLOB NOT NULL;
mysql$ SOURCE /temp/revision.sql

// restore back to square one
mysql$ ALTER TABLE revision ADD COLUMN rev_len INT(10) UNSIGNED DEFAULT NULL AFTER rev_deleted, ADD COLUMN rev_parent_id INT(10) UNSIGNED DEFAULT NULL AFTER rev_len;
mysql$ ALTER TABLE revision ADD COLUMN rev_sha1 VARBINARY(32) NULL AFTER rev_deleted;

pagelinks and categorylinks

I do not recommend "SOURCE pagelinks.sql;SOURCE categorylinks.sql;" for very big Wikipedias, e.g. English Wikipedia, see description here Synarcher (read text after "Tables pagelinks and categorylinks require special attention...").

The result database is ready. It could be used as a source for the Wiktionary parser. Let's dump it into file (just in case):

  • mysqldump --opt -u root -p --compatible=ansi --default-character-set=binary enwikt20100824 > enwikt20100824.sql

Grant privilegies

Add the user (e.g. 'javawiki') to MySQL database. Grant privileges at database levels (e.g. 'ruwiki' database). Open MySQL command-line and run commands:

mysql>CREATE USER javawiki;
mysql>GRANT SELECT ON ruwiki.* TO javawiki@'%';
        (with password; from any computer>GRANT SELECT PRIVILEGES ON ruwiki.* TO javawiki identified by '12345')
mysql>FLUSH PRIVILEGES;

Next step

See also

Links


Sign in to add a comment
Powered by Google Project Hosting