Extension talk:DumpHTML
Contents
I'm trying to get this working on a site with Mediawiki version 1.20. I'm experiencing an issue where the stylesheet references (in the <head> tag) are always referenced via a localhost path to /main/load.php, instead of a relative path to /skins/monobrook/. Any ideas? All of the URL links are working just fine. I'm currently using the dd18651 Git snapshot (which is tagged for 1.20).
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="http://localhost/mysite/main/index.php?title=Wiki_Home&action=edit" /> <link rel="edit" title="Edit this page" href="http://localhost/mysite/main/index.php?title=Wiki_Home&action=edit" /> <link rel="shortcut icon" href="./misc/favicon.ico" /> <link rel="search" type="application/opensearchdescription+xml" href="../opensearch_desc.php" title="Wiki (en)" /> <link rel="EditURI" type="application/rsd+xml" href="api.php?action=rsd" /> <link rel="alternate" type="application/atom+xml" title="Wiki Atom feed" href="http://localhost/mysite/main/index.php?title=Special:RecentChanges&feed=atom" /> <link rel="stylesheet" href="http://localhost/mysite/main/load.php?debug=false&lang=en&modules=mediawiki.legacy.commonPrint%2Cshared%7Cskins.monobook%2Cmonobook&only=styles&skin=monobook&*" /> <!--[if IE 6]><link rel="stylesheet" href="./skins/monobook/IE60Fixes.css?47" media="screen" /><![endif]--> <!--[if IE 7]><link rel="stylesheet" href="./skins/monobook/IE70Fixes.css?47" media="screen" /><![endif]--><meta name="ResourceLoaderDynamicStyles" content="" /> <link rel="stylesheet" href="http://localhost/mysite/main/load.php?debug=false&lang=en&modules=site&only=styles&skin=monobook&*" /> <style>a:lang(ar),a:lang(ckb),a:lang(fa),a:lang(kk-arab),a:lang(mzn),a:lang(ps),a:lang(ur){text-decoration:none}.editsection{display:none} </style> <script src="http://localhost/mysite/main/load.php?debug=false&lang=en&modules=startup&only=scripts&skin=monobook&*"></script>
When I run the script I get a wall of errors when it tries to copy files.
CMD:
C:\Users\WikiUser>D:\xampp\php\php.exe D:\xampp\htdocs\wiki\ext ensions\DumpHTML\dumpHTML.php -d C:\Users\01SIN382\Desktop\wikidump --image-snap shot --force-copy --group=user --show-titles --munge-title windows
Error Message:
Warning: unable to copy D:\xampp\htdocs\wiki/images/3/3f/VMware-setup 6.png to C:\Users\WikiUser\Desktop\wikidump/images/3/_/3/3_3f_VMware-setup6.png Warning: call_user_func_array() expects parameter 1 to be a valid callback, clas s 'LocalRepo' does not have a method 'getLocalCopy' in D:\xampp\htdocs\wiki\extensions\DumpHTML\dumpHTML.inc on line1169
I hope someone can help me =)
Greets Greg
Sorry, can't help you here. It works on Linux (see http://misc.j-crew.de/wiki-dump/current/ for a working example). For reference, I use only the -d and the --image-snapshot option. I fear you'll have to debug this yourself (though I should warn that the whole file access code is quite difficult to follow). Patches to fix this (and other) issues greatly appreciated.
Actually i'm trying this on linux and it is NOT working. I'm using the command below:
php /usr/local/mediawiki-1.18.2/extensions/wikimedia-mediawiki-extensions-DumpHTML-dd18651/dumpHTML.php -d /home/mtz/HTMLDump --image-snapshot --force-copy --group=mygroup
The images are NOT copied. Everything else works except for that though. Interestingly, when I tried testing the script with a new mediawiki installation it works (images and all). Maybe the script works only for small wiki installations?
Sounds like you're using an old version of the DumpHTML extension. --force-copy was removed in February 2012, which fixed a bug where no images were copied (see [1]). Please make sure you use the most current version from Git.
Still not working dude. I downloaded the master branch from git but the errors are still the same such as:
Warning: unable to copy /usr/local/mediawiki-1.18.2/images/3/34/Ug-download-ms.png to /home/mtz/HTMLDump/images/3/2F/3/3/34/Ug-download-ms.png
and
PHP Warning: call_user_func_array() expects parameter 1 to be a valid callback, class 'LocalRepo' does not have a method 'getLocalCopy' in /usr/local/mediawiki-1.18.2/extensions/DumpHTML/dumpHTML.inc on line 1169
Note: as you can see from the message above my installation is mediawiki-1.18.2. What branch of dumpHTML should i use with this version? As i said i tested the master branch.
I removed the force-copy option from my command as you sugested. So what i run now is this:
php /usr/local/mediawiki-1.18.2/extensions/DumpHTML/dumpHTML.php -d /home/mtz/HTMLDump --image-snapshot --group=myGroup
I'm running Debian GNU/Linux 6.0.
Hi. I'm storing my images on commons. I'm trying to run this extension, but it stops after a few pages and I cannot figure out why. Maybe not finding the images on the local filesystem makes it crash?
When I export with the HTMLdump extension it always uses the Offline Skin. And its realy ugly. How can I make my Export to the Monobook Skin? Thank you.
The tool indeed seemingly ignores the -k switch. Does anyone know when it broke?
It should (mostly) work when using MediaWiki 1.21 (the latest stable version) and the latest DumpHTML from Git. See http://misc.j-crew.de/wiki-dump/current/ for an example. Some things are still broken though, like PDF thumbnails and images from Commons.
When i try to run the script, i always get the error message: default users are not allowed to read, please specify (--group=sysop). I also tried it with this option, but then i become the error message "the specified user group is not allowed to read". Any ideas? :)
I had to check out the database of my mediawiki installation to find out the user group (check out LocalSettings.php and search for "Database settings"). Connect to the database using a client. Look for a table named user_groups. On this table there is a mapping from ids of users to user group. Try out diferent user group names until it works. :p
Is there an alternative to dumpHTML? I'm finding this waaaay to complicated to get it working.
I'm using windows 7 and i'm getting lots of "Failed opening required .../maintenance/commandLine.inc" type errors. It also says "No such file or directory in ... \dumpHTML on line 63". Line 63 from the script has "require_once( $IP."/maintenance/commandLine.inc" ); " Do i have to change something here? What is $IP ? How do i set it?
I'm using this command:
php C:\wamp\www\mediawiki-1.21.1\extensions\wikimedia-mediawiki-extensions-DumpHTML-dd18651\wikimedia-mediawiki-extensions-DumpHTML-dd18651/dumpHTML.php -d C:\HTMLDump -k monobook --image-snapshot --force-copy
And my setup is this: Apache 2.2.22 – Mysql 5.5.24 – PHP 5.3.13 XDebug 2.1.2 XDC 1.5 PhpMyadmin 3.4.10.1 SQLBuddy 1.3.3 webGrind 1.0
Got it working after all. My dumpHTML.php was two folder deep (\wikimedia-mediawiki-extensions-DumpHTML-dd18651\wikimedia-mediawiki-extensions-DumpHTML-dd18651/). I had to cut one level so i could execute the command below:
php C:\wamp\www\mediawiki-1.21.1\extensions\wikimedia-mediawiki-extensions-DumpHTML-dd18651/dumpHTML.php -d C:\HTMLDump -k monobook --image-snapshot --force-copy
Hello :)
we searched, and found many people with the same error, but with no solutions... perhaps someone can help us...
we get the following error, when we try to run the dumpHTML.php from shell (different working folders, same error): Fatal error: require_once() [<a href='function.require'>function.require</a>]: Failed opening required '__DIR__/Maintenance.php' (include_path='.:/usr/local/lib/php') in /.../mediawiki/maintenance/commandLine.inc on line 24
kind regards, markus h.
try the latest version of DumpHTML&Mediawiki and run DumpHTML as extension (not in the maintenance directory).
Thank you for your answer, but that was not the Problem.
Out PHP-Version was too low (5.2.17). We updated php to 5.3.10 and now the script runs....
BUT: it only creates the index.html and all the links lead back to the online-version, which is of course non-sense for an offline-copy ;) Does anyone know, how we can force dumpHTML to download ALL pages as html-files and make relative links?
kind regards, markus h.
I have noticed 2 problems with the DumpHTML extension in combination with the latest version of mediawiki. Mediawiki version: 1.20.3. Version of DumpHTML: rev 115794 from mediawiki svn.
The first problem is that if you use 'subpages' ( /s in urls ) then the links are broken in the static output. The generated pages have the incorrect number of ".." relative links in them. This may potentially be resolved by setting 'munge-title' to md5; to prevent the extra slashes from being dumped into the name, but I didn't try to do so. ( and I don't want to do so, I want the titles to be left alone, not converted to a hash )
How come there is no way to output pages in the same structure as they exist originally? Why are you -forced- to use the new 3 folder deep "hashing" mechanism? I assume this is to prevent too many files from being dumpdr into the same folder, but there should be a way to shut this off and the code should be fixed to allow subpages.
The second problem is that image-snapshot seems unable to handle png images. The main icon in the upper left of my wiki is a png. A file for that is created in the duplicate, but it is a 0 byte file. The original file is not copied. It seems the images are not directly copied, and that DumpHTML is not able to handle pngs.
If you got a problem with images, try running with "sudo php dumpHTML.php ..." (yes, you can kill me now). DONT TRY TO USE IT ON PRODUCTION SERVER, ITS DANGEROUS! The problem seem to be coming from http://www.mediawiki.org/wiki/Manual:Image_Authorization, but at least with sudo it works as is.
Is there any way to generate an export/dump that uses the exact same URLs for the dumped content as were used for the normal wiki? Or at least have it write a .htaccess file which redirects all the old URLs to their new locations?
These are a few points that are not so obvious when using DumpHTML for the first time:
- First of all, you need to download the latest version: https://github.com/wikimedia/mediawiki-extensions-DumpHTML/archive/master.zip
- DumpHTML gets its data from 2 sources: (1) the wiki database and (2) from http calls to the web server. This means that the web server must be running when dumping to html. If not, at least the Logo image and Favicon will be created as zero-byte files in the static html dump.
- Windows users must use '--munge-title windows' (without the single quotes), otherwise names of some html files will be truncated and the contents will be empty. This is true for articles whose title include illegal Windows filename characters like /\*?"<>|. The --munge-title option makes sure that these characters are not present in the destination filenames.
- If you really want an offline static html dump, let DumpHTML use the default '-k offline' skin (or omit the -k switch, which is the same). When using a named skin other than 'offline', references to the live wiki will be included in the dump. In that case, the live wiki must be available while viewing the html dump, otherwise the skin will not load. The '-k offline' switch removes these references and uses a monobook-like offline skin instead.
- Make sure that the destination folder for the dump does not exist. If it already exists, some subfolders won't be created, skins won't be copied and the offline skin won't be available in the static html dump.
- --force-copy apparently does not do anything
- --show-titles is not documented
- DumpHTML creates a dumpHTML.version file in the destination folder. This file holds the version number of DumpHTML. It reads 2.0 but should really be 1.20
using --munge-title windows or any other options i get this error:
Unexpected non-MediaWiki exception encountered, of type "Exception" exception 'Exception' with message 'no such titlemunger exists: 1' in /dir/w/extensions/DumpHTML/MungeTitle.inc:18 Stack trace:
0 /dir/w/extensions/DumpHTML/dumpHTML.inc(92): MungeTitle->__construct(1) 1 /dir/w/extensions/DumpHTML/dumpHTML.php(132): DumpHTML->__construct(Array) 2 {main}
Hi, it appears that this is a newer parameter, and you apparently need to check out and run the latest version from svn.
From https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/DumpHTML/dumpHTML.php?view=log I learned that the new parameter must be called as
--munge-title <HOW> available munging algorithms: none, md5, windows
Please make sure to use the recent version (I did not check it, and it is not code reviewed), or indicate here the exact version of the extension you run.
HTTP ERROR: 404 Problem accessing /r/gitweb. Reason: Not Found Powered by Jetty://
any idea where i can find this??
It appears that starting with MediaWiki 1.17.0, dumpHTML no longer downloads relevant CSS/JavaScript bits. This appears to be a result of the ResourceLoader module, via load.php. This module and ordeal is documented in the MediaWiki 1.17.0 release notes.
Even tools like "wget -r" won't solve this problem. So given this change, what exactly are people doing for creating HTML archives of Wiki pages for offline or Intranet use?
My chars, mostly diacritic chars in dumped htmls seem changed.
Examples:
- Saṃyutta Nikāya -> Sa峁儁utta Nik膩ya
- … -> 鈥�
- Soṇadaṇḍa Sutta -> So峁嘺da峁囜笉a
Does anyone have similar issue?
Thanks.
When there is an umlaut in the filename of an image, the image will be saved in the dump but with a wrong name - the link in the HTML is not working. Does somebody know how to fix this?
I have a MW V1.16.0 with the Lockdown-extension installed and need a username and password to look at it. It is a Windows system so I used the modified Version of dumphtml that produces the hash-filenames. How can i provide username and password with DumpHTML? First time using, DumpHTML asked me providing a -group parameter and i did. DumpHTML then produced a lot of stuff. But i cant login to the index site in the static wiki. Furthermore some (many,most) pages and their pathes are missing in the static wiki. The page "login required" (Anmeldung erforderlich in German) exist multiple, multiple times. Everytime a page is shown it is this one, each on another path and filename. I tried to give 'read' permission to * in localsettings.php, but then the extension produces a static wiki without any style, pictures...and even many pages and paths linked to do not exist.
get the version for MW 1.16 - there is a --group parameter. use the group "user" (sysop doesn't work for me)
"If you intend to use the wikidump on a CD/DVD or on a Windows filesystem, and if your wiki pages or files had non-ASCII characters, which is likely, then you probably need to change the link references,"
Should this be "if your wiki page titles or filenames had non-ASCII characters"? If correct, it would be much clearer.
If the error is
DB connection error: No database connection
It may be a problem logging into the database. Looking in the PostgreSQL logs revealed I had to adapt pg_hba.conf
This is the error message I get when I try to execute the dumpHTML.php file on my local machine. Does anybody know a fix for that?
I have a similar problem which I cant solve:
DB connection error: Ein Verbindungsversuch ist fehlgeschlagen, da die Gegenstel le nach einer bestimmten Zeitspanne nicht richtig reagiert hat, oder die hergest ellte Verbindung war fehlerhaft, da der verbundene Host nicht reagiert hat. (localhost)
- Mediawiki 13.3, Win7, Mowes webserver, just installed php5.3 without webserver support to execute the dumpHTML.php
- I can trace the problem to occur in includes/db/Loadbalancer.php function reallyOpenConnection, $db = new $class( $host, $user, $password, $dbname, 1, $flags ); just times out
- database class is DatabaseMysql (I cant find the source), host, database name, user and password are correct
- any help?