Extension:DumpHTML
dumpHTML Release status: stable |
|
---|---|
Implementation | Data extraction |
Description | Creates a simple HTML dump of a MediaWiki installation. |
Author(s) | Tim Starling |
Latest version | 1.18.0+ |
License | GPL or Any OSI approved license |
Download | |
Translate the DumpHTML extension if possible |
|
Check usage and version matrix; code metrics |
dumpHTML is an extension for generating a simple HTML dump, including images and media files, of a MediaWiki installation. MediaWiki versions before 1.12.0 used the maintenance script dumpHTML.php instead.
Contents
Parameters[edit | edit source]
dumpHTML does not function like a normal extension; you must run it from the command line.
Option/Parameter | Description |
---|---|
-d <dest> | destination directory |
-s <start> | start ID |
-e <end> | end ID |
-k <skin> | skin to use (defaults to offline) |
--no-overwrite | skip existing HTML files |
--checkpoint <file> | use a checkpoint file to allow restarting of interrupted dumps |
--slice <n/m> | split the job into m segments and do the n'th one |
--images | only do image description pages |
--shared-desc | only do shared (commons) image description pages |
--no-shared-desc | don't do shared image description pages |
--categories | only do category pages |
--redirects | only do redirects |
--special | only do miscellaneous stuff |
--interlang | allow interlanguage links |
--image-snapshot | copy all images used to the destination directory |
--compress | generate compressed version of the html pages |
--udp-profile <N> | profile 1/N rendering operations using ProfilerSimpleUDP |
--munge-title <HOW> | available munging algorithms: none, md5, windows |
Example to create a complete snapshot including image and media files and image thumbnail files in directory wikidump (LINUX)
/usr/bin/php /srv/www/mediawiki/extensions/DumpHTML/dumpHTML.php -d /srv/www/mediawiki/wikidump -k monobook --image-snapshot
Known issues[edit | edit source]
Warning! This extension is not properly maintained at the moment! You may encounter a number of issues. Any help fixing these (especially by sending patches to Gerrit) is greatly appreciated!
Filename problems solved by a modified version of DumpHTML[edit | edit source]
- fixed via (--munge-title <HOW> available munging algorithms: none, md5, windows) in r115629
If you intend to use the wikidump on a CD/DVD or on a Windows filesystem, and if the wiki pages or files had non-ASCII characters (which is likely) then you probably need to change the link references, the directories, and filenames from UTF-8 to your Windows character encoding (for example to codepage 1252 for Western-European systems), but browsers may still have difficulties accessing the files.
Bugzilla 8147 "Filenames in the HTML static dump" has a patch for DumpHTML.inc that converts article, image, thumbnail image, and media filenames to their MD5-hashed version, which avoids character encoding problems on different operation systems.
Skin hacking[edit | edit source]
If you modified your skin (e.g. monobook) then this script will likely fail. Upgrade/update your mediawiki installation and replace any "hacked" skins, then re-try.
Extensions compatibility[edit | edit source]
For the same reason, some extensions modifying output aren't compatible with DumpHTML, like Extension:SyntaxHighlight_GeSHi.
If you use InstantCommons[edit | edit source]
If you use your dump on a custom MediaWiki install using InstantCommons, the script will consider your images files are in the images/wikimediacommons folder of the target directory.
Thus, if you encounter a message as:
Warning: file_put_contents(/tmp/wiki/images/wikimediacommons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg): failed to open stream: No such file or directory in [...]/w/extensions/DumpHTML/dumpHTML.inc on line 1377
You have to download http://upload.wikimedia.org/wikipedia/commons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg to /tmp/yourdump/images/wikimediacommons/7/75/Live_studio_op_de_Mi_Amigo_kleine.jpg and restart the dump operation.
Static Wikipedia[edit | edit source]
See http://dumps.wikimedia.org/ and for example http://dumps.wikimedia.org/other/static_html_dumps/ for static snapshot examples. The last HTML dumps there were generated in 2008 (bug 15017).