Extension talk:DumpHTML

From MediaWiki.org
Jump to: navigation, search
An archive box Archives 

/Archive 1

Start a new discussion

Contents

Thread titleRepliesLast modified
Localhost stylesheets015:08, 7 November 2013
Only executing dumpHTML on specific categories AND their sub categories018:55, 12 July 2013
Error: Unable to copy file708:59, 29 June 2013
Does it support ForeignFileRepos?105:59, 16 June 2013
How to export Monobook Skin to the HTMLdump?305:54, 16 June 2013
Error218:41, 13 June 2013
Alternative?112:24, 13 June 2013
Fatal error, running DumpHTML from Shell213:29, 21 May 2013
Subpages are not exported correctly - PNGs are copied as 0 byte files208:44, 16 May 2013
Keep URLs / create .htaccess with redirects?021:48, 14 March 2013
Gotchas when using DumpHTML014:34, 11 February 2013
[SOLVED]munge-title error309:38, 29 July 2012
[SOLVED] no git?208:51, 28 July 2012
Broken since MediaWiki 1.17.0?217:23, 5 June 2012
Unicode diacritic character on dumped html010:37, 10 January 2012
Bug with german umlauts in filenames of images010:37, 10 January 2012
Howto provide login data110:36, 10 January 2012
Minor: pages vrs page titles010:34, 10 January 2012
PostgreSQL and Mediawiki 1.12.0210:34, 10 January 2012

Localhost stylesheets

I'm trying to get this working on a site with Mediawiki version 1.20. I'm experiencing an issue where the stylesheet references (in the <head> tag) are always referenced via a localhost path to /main/load.php, instead of a relative path to /skins/monobrook/. Any ideas? All of the URL links are working just fine. I'm currently using the dd18651 Git snapshot (which is tagged for 1.20).

<link rel="alternate" type="application/x-wiki" title="Edit this page" href="http://localhost/mysite/main/index.php?title=Wiki_Home&amp;action=edit" />
<link rel="edit" title="Edit this page" href="http://localhost/mysite/main/index.php?title=Wiki_Home&amp;action=edit" />
<link rel="shortcut icon" href="./misc/favicon.ico" />
<link rel="search" type="application/opensearchdescription+xml" href="../opensearch_desc.php" title="Wiki (en)" />
<link rel="EditURI" type="application/rsd+xml" href="api.php?action=rsd" />
<link rel="alternate" type="application/atom+xml" title="Wiki Atom feed" href="http://localhost/mysite/main/index.php?title=Special:RecentChanges&amp;feed=atom" />
<link rel="stylesheet" href="http://localhost/mysite/main/load.php?debug=false&amp;lang=en&amp;modules=mediawiki.legacy.commonPrint%2Cshared%7Cskins.monobook%2Cmonobook&amp;only=styles&amp;skin=monobook&amp;*" />
<!--[if IE 6]><link rel="stylesheet" href="./skins/monobook/IE60Fixes.css?47" media="screen" /><![endif]-->
<!--[if IE 7]><link rel="stylesheet" href="./skins/monobook/IE70Fixes.css?47" media="screen" /><![endif]--><meta name="ResourceLoaderDynamicStyles" content="" />
<link rel="stylesheet" href="http://localhost/mysite/main/load.php?debug=false&amp;lang=en&amp;modules=site&amp;only=styles&amp;skin=monobook&amp;*" />
<style>a:lang(ar),a:lang(ckb),a:lang(fa),a:lang(kk-arab),a:lang(mzn),a:lang(ps),a:lang(ur){text-decoration:none}.editsection{display:none}
</style>
<script src="http://localhost/mysite/main/load.php?debug=false&amp;lang=en&amp;modules=startup&amp;only=scripts&amp;skin=monobook&amp;*"></script>
lziobro (talk)15:08, 7 November 2013

Only executing dumpHTML on specific categories AND their sub categories

Is there a way to tell dumpHTML to convert on specific categories along with their sub categories? I do not want to use --categories because it converts ALL of the wiki's categories. I just want to convert a specific category along with ALL of its sub categories

Zc5 (talk)18:55, 12 July 2013

Error: Unable to copy file

When I run the script I get a wall of errors when it tries to copy files.
CMD:
C:\Users\WikiUser>D:\xampp\php\php.exe D:\xampp\htdocs\wiki\ext ensions\DumpHTML\dumpHTML.php -d C:\Users\01SIN382\Desktop\wikidump --image-snap shot --force-copy --group=user --show-titles --munge-title windows
Error Message:
Warning: unable to copy D:\xampp\htdocs\wiki/images/3/3f/VMware-setup 6.png to C:\Users\WikiUser\Desktop\wikidump/images/3/_/3/3_3f_VMware-setup6.png Warning: call_user_func_array() expects parameter 1 to be a valid callback, clas s 'LocalRepo' does not have a method 'getLocalCopy' in D:\xampp\htdocs\wiki\extensions\DumpHTML\dumpHTML.inc on line1169

I hope someone can help me =)
Greets Greg

80.121.159.19713:22, 1 March 2013

did you fin a solution to this issue?

193.9.13.13513:38, 13 March 2013
 

Having the same problem here! :(

187.78.14.20318:56, 13 June 2013
 

Sorry, can't help you here. It works on Linux (see http://misc.j-crew.de/wiki-dump/current/ for a working example). For reference, I use only the -d and the --image-snapshot option. I fear you'll have to debug this yourself (though I should warn that the whole file access code is quite difficult to follow). Patches to fix this (and other) issues greatly appreciated.

Tbleher (talk)05:58, 16 June 2013

Actually i'm trying this on linux and it is NOT working. I'm using the command below:

php /usr/local/mediawiki-1.18.2/extensions/wikimedia-mediawiki-extensions-DumpHTML-dd18651/dumpHTML.php -d /home/mtz/HTMLDump --image-snapshot --force-copy --group=mygroup

The images are NOT copied. Everything else works except for that though. Interestingly, when I tried testing the script with a new mediawiki installation it works (images and all). Maybe the script works only for small wiki installations?

187.78.14.20313:42, 18 June 2013

Sounds like you're using an old version of the DumpHTML extension. --force-copy was removed in February 2012, which fixed a bug where no images were copied (see [1]). Please make sure you use the most current version from Git.

Tbleher (talk)13:52, 23 June 2013

Still not working dude. I downloaded the master branch from git but the errors are still the same such as:

Warning: unable to copy /usr/local/mediawiki-1.18.2/images/3/34/Ug-download-ms.png to /home/mtz/HTMLDump/images/3/2F/3/3/34/Ug-download-ms.png

and

PHP Warning: call_user_func_array() expects parameter 1 to be a valid callback, class 'LocalRepo' does not have a method 'getLocalCopy' in /usr/local/mediawiki-1.18.2/extensions/DumpHTML/dumpHTML.inc on line 1169

Note: as you can see from the message above my installation is mediawiki-1.18.2. What branch of dumpHTML should i use with this version? As i said i tested the master branch.

I removed the force-copy option from my command as you sugested. So what i run now is this:

php /usr/local/mediawiki-1.18.2/extensions/DumpHTML/dumpHTML.php -d /home/mtz/HTMLDump --image-snapshot --group=myGroup

I'm running Debian GNU/Linux 6.0.

187.78.14.20312:36, 28 June 2013

I don't know what version of DumpHTML (if any) will work with MediaWiki 1.18.2. Your best bet is to try it with the current MediaWiki version (1.21). If it still doesn't work with 1.21, you'll probably have to debug it yourself. Patches to fix any issue are of course highly appreciated!

Tbleher (talk)08:59, 29 June 2013
 
 
 
 
 

Does it support ForeignFileRepos?

Hi. I'm storing my images on commons. I'm trying to run this extension, but it stops after a few pages and I cannot figure out why. Maybe not finding the images on the local filesystem makes it crash?

88.187.236.23208:47, 20 January 2012

It shouldn't abort (at least not in the current version) but it definitely doesn't import images from ForeignRepos.

Tbleher (talk)05:59, 16 June 2013
 

How to export Monobook Skin to the HTMLdump?

When I export with the HTMLdump extension it always uses the Offline Skin. And its realy ugly. How can I make my Export to the Monobook Skin? Thank you.

—The preceding unsigned comment was added by an unknown user on a unknown date.10:35, 10 January 2012

The tool indeed seemingly ignores the -k switch. Does anyone know when it broke?

134.130.21.8513:13, 10 April 2012

It seems the stylesheet URL is broken.

217.92.109.13217:13, 29 November 2012
 

It should (mostly) work when using MediaWiki 1.21 (the latest stable version) and the latest DumpHTML from Git. See http://misc.j-crew.de/wiki-dump/current/ for an example. Some things are still broken though, like PDF thumbnails and images from Commons.

Tbleher (talk)05:54, 16 June 2013
 

When i try to run the script, i always get the error message: default users are not allowed to read, please specify (--group=sysop). I also tried it with this option, but then i become the error message "the specified user group is not allowed to read". Any ideas? :)

—The preceding unsigned comment was added by an unknown user on a unknown date.10:36, 10 January 2012

The group "user" works for me

--212.114.205.190 11:49, 29 September 2011 (UTC)10:36, 10 January 2012
 

I had to check out the database of my mediawiki installation to find out the user group (check out LocalSettings.php and search for "Database settings"). Connect to the database using a client. Look for a table named user_groups. On this table there is a mapping from ids of users to user group. Try out diferent user group names until it works. :p

187.78.14.20318:41, 13 June 2013
 

Alternative?

Is there an alternative to dumpHTML? I'm finding this waaaay to complicated to get it working.

I'm using windows 7 and i'm getting lots of "Failed opening required .../maintenance/commandLine.inc" type errors. It also says "No such file or directory in ... \dumpHTML on line 63". Line 63 from the script has "require_once( $IP."/maintenance/commandLine.inc" ); " Do i have to change something here? What is $IP ? How do i set it?

I'm using this command:

php C:\wamp\www\mediawiki-1.21.1\extensions\wikimedia-mediawiki-extensions-DumpHTML-dd18651\wikimedia-mediawiki-extensions-DumpHTML-dd18651/dumpHTML.php -d C:\HTMLDump -k monobook --image-snapshot --force-copy

And my setup is this: Apache 2.2.22 – Mysql 5.5.24 – PHP 5.3.13 XDebug 2.1.2 XDC 1.5 PhpMyadmin 3.4.10.1 SQLBuddy 1.3.3 webGrind 1.0

187.78.14.20312:05, 13 June 2013

Got it working after all. My dumpHTML.php was two folder deep (\wikimedia-mediawiki-extensions-DumpHTML-dd18651\wikimedia-mediawiki-extensions-DumpHTML-dd18651/). I had to cut one level so i could execute the command below:

php C:\wamp\www\mediawiki-1.21.1\extensions\wikimedia-mediawiki-extensions-DumpHTML-dd18651/dumpHTML.php -d C:\HTMLDump -k monobook --image-snapshot --force-copy

187.78.14.20312:24, 13 June 2013
 

Fatal error, running DumpHTML from Shell

Hello :)

we searched, and found many people with the same error, but with no solutions... perhaps someone can help us...

we get the following error, when we try to run the dumpHTML.php from shell (different working folders, same error): Fatal error: require_once() [<a href='function.require'>function.require</a>]: Failed opening required '__DIR__/Maintenance.php' (include_path='.:/usr/local/lib/php') in /.../mediawiki/maintenance/commandLine.inc on line 24

kind regards, markus h.

91.57.77.4612:11, 17 May 2013

try the latest version of DumpHTML&Mediawiki and run DumpHTML as extension (not in the maintenance directory).

Kelson (talk)11:07, 18 May 2013

Thank you for your answer, but that was not the Problem.

Out PHP-Version was too low (5.2.17). We updated php to 5.3.10 and now the script runs....

BUT: it only creates the index.html and all the links lead back to the online-version, which is of course non-sense for an offline-copy ;) Does anyone know, how we can force dumpHTML to download ALL pages as html-files and make relative links?

kind regards, markus h.

91.57.89.20013:29, 21 May 2013
 
 

Subpages are not exported correctly - PNGs are copied as 0 byte files

I have noticed 2 problems with the DumpHTML extension in combination with the latest version of mediawiki. Mediawiki version: 1.20.3. Version of DumpHTML: rev 115794 from mediawiki svn.

The first problem is that if you use 'subpages' ( /s in urls ) then the links are broken in the static output. The generated pages have the incorrect number of ".." relative links in them. This may potentially be resolved by setting 'munge-title' to md5; to prevent the extra slashes from being dumped into the name, but I didn't try to do so. ( and I don't want to do so, I want the titles to be left alone, not converted to a hash )

How come there is no way to output pages in the same structure as they exist originally? Why are you -forced- to use the new 3 folder deep "hashing" mechanism? I assume this is to prevent too many files from being dumpdr into the same folder, but there should be a way to shut this off and the code should be fixed to allow subpages.

The second problem is that image-snapshot seems unable to handle png images. The main icon in the upper left of my wiki is a png. A file for that is created in the duplicate, but it is a 0 byte file. The original file is not copied. It seems the images are not directly copied, and that DumpHTML is not able to handle pngs.

Livxtrm (talk)19:50, 25 March 2013

If you got a problem with images, try running with "sudo php dumpHTML.php ..." (yes, you can kill me now). DONT TRY TO USE IT ON PRODUCTION SERVER, ITS DANGEROUS! The problem seem to be coming from http://www.mediawiki.org/wiki/Manual:Image_Authorization, but at least with sudo it works as is.

83.167.103.9822:11, 25 April 2013

Running the dumpHTML.php with root permissions doesn't seem to solve the problem (at least for me).

141.5.11.508:44, 16 May 2013
 
 

Keep URLs / create .htaccess with redirects?

Is there any way to generate an export/dump that uses the exact same URLs for the dumped content as were used for the normal wiki? Or at least have it write a .htaccess file which redirects all the old URLs to their new locations?

84.168.47.24221:48, 14 March 2013

Gotchas when using DumpHTML

These are a few points that are not so obvious when using DumpHTML for the first time:

  • DumpHTML gets its data from 2 sources: (1) the wiki database and (2) from http calls to the web server. This means that the web server must be running when dumping to html. If not, at least the Logo image and Favicon will be created as zero-byte files in the static html dump.
  • Windows users must use '--munge-title windows' (without the single quotes), otherwise names of some html files will be truncated and the contents will be empty. This is true for articles whose title include illegal Windows filename characters like /\*?"<>|. The --munge-title option makes sure that these characters are not present in the destination filenames.
  • If you really want an offline static html dump, let DumpHTML use the default '-k offline' skin (or omit the -k switch, which is the same). When using a named skin other than 'offline', references to the live wiki will be included in the dump. In that case, the live wiki must be available while viewing the html dump, otherwise the skin will not load. The '-k offline' switch removes these references and uses a monobook-like offline skin instead.
  • Make sure that the destination folder for the dump does not exist. If it already exists, some subfolders won't be created, skins won't be copied and the offline skin won't be available in the static html dump.
  • --force-copy apparently does not do anything
  • --show-titles is not documented
  • DumpHTML creates a dumpHTML.version file in the destination folder. This file holds the version number of DumpHTML. It reads 2.0 but should really be 1.20
195.62.68.24414:34, 11 February 2013

[SOLVED]munge-title error

using --munge-title windows or any other options i get this error:

Unexpected non-MediaWiki exception encountered, of type "Exception" exception 'Exception' with message 'no such titlemunger exists: 1' in /dir/w/extensions/DumpHTML/MungeTitle.inc:18 Stack trace:

   0 /dir/w/extensions/DumpHTML/dumpHTML.inc(92): MungeTitle->__construct(1)
   1 /dir/w/extensions/DumpHTML/dumpHTML.php(132): DumpHTML->__construct(Array)
   2 {main}
Aditaa (talk)08:47, 28 July 2012

Hi, it appears that this is a newer parameter, and you apparently need to check out and run the latest version from svn.

From https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/DumpHTML/dumpHTML.php?view=log I learned that the new parameter must be called as

 --munge-title <HOW> available munging algorithms: none, md5, windows

Please make sure to use the recent version (I did not check it, and it is not code reviewed), or indicate here the exact version of the extension you run.

Wikinaut (talk)08:58, 28 July 2012

Yes I am using the new code from svn

 DumpHTML]# svn up
 At revision 115628.

and I have tried with all of the options and it returns the same result

Aditaa (talk)09:03, 28 July 2012

The argument processing has been fixed in r115629.

Aditaa (talk)09:38, 29 July 2012
 
 
 

[SOLVED] no git?

url link : https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/DumpHTML.git;a=snapshot;h=refs/heads/master;sf=tgz

HTTP ERROR: 404
 
Problem accessing /r/gitweb. Reason:
 
    Not Found
 
Powered by Jetty://

any idea where i can find this??

Aditaa (talk)11:42, 27 July 2012

It's in SVN unfortunately. I've changed the links.

Krenair (talkcontribs)12:37, 27 July 2012

tnks

Aditaa (talk)12:38, 27 July 2012
 
 

Broken since MediaWiki 1.17.0?

It appears that starting with MediaWiki 1.17.0, dumpHTML no longer downloads relevant CSS/JavaScript bits. This appears to be a result of the ResourceLoader module, via load.php. This module and ordeal is documented in the MediaWiki 1.17.0 release notes.

Even tools like "wget -r" won't solve this problem. So given this change, what exactly are people doing for creating HTML archives of Wiki pages for offline or Intranet use?

—The preceding unsigned comment was added by an unknown user on a unknown date. -- 23:00, 9 January 2012 (UTC)10:38, 10 January 2012

That would be very interesting for me too!!

62.214.112.3515:39, 3 February 2012
 

Did you or anyone else find a solution to this problem yet? I'm running into it as well.

206.86.87.317:23, 5 June 2012
 

Unicode diacritic character on dumped html

My chars, mostly diacritic chars in dumped htmls seem changed.

Examples:

  • Saṃyutta Nikāya -> Sa峁儁utta Nik膩ya
  • … -> 鈥�
  • Soṇadaṇḍa Sutta -> So峁嘺da峁囜笉a

Does anyone have similar issue?

Thanks.

Benzwu10:37, 10 January 2012

Bug with german umlauts in filenames of images

When there is an umlaut in the filename of an image, the image will be saved in the dump but with a wrong name - the link in the HTML is not working. Does somebody know how to fix this?

--212.114.205.190 12:51, 29 September 2011 (UTC)10:37, 10 January 2012

Howto provide login data

I have a MW V1.16.0 with the Lockdown-extension installed and need a username and password to look at it. It is a Windows system so I used the modified Version of dumphtml that produces the hash-filenames. How can i provide username and password with DumpHTML? First time using, DumpHTML asked me providing a -group parameter and i did. DumpHTML then produced a lot of stuff. But i cant login to the index site in the static wiki. Furthermore some (many,most) pages and their pathes are missing in the static wiki. The page "login required" (Anmeldung erforderlich in German) exist multiple, multiple times. Everytime a page is shown it is this one, each on another path and filename. I tried to give 'read' permission to * in localsettings.php, but then the extension produces a static wiki without any style, pictures...and even many pages and paths linked to do not exist.

—The preceding unsigned comment was added by an unknown user on a unknown date.10:35, 10 January 2012

get the version for MW 1.16 - there is a --group parameter. use the group "user" (sysop doesn't work for me)

--212.114.205.190 11:48, 29 September 2011 (UTC)10:36, 10 January 2012
 

Minor: pages vrs page titles

"If you intend to use the wikidump on a CD/DVD or on a Windows filesystem, and if your wiki pages or files had non-ASCII characters, which is likely, then you probably need to change the link references,"

Should this be "if your wiki page titles or filenames had non-ASCII characters"? If correct, it would be much clearer.

—The preceding unsigned comment was added by an unknown user on a unknown date.10:34, 10 January 2012

PostgreSQL and Mediawiki 1.12.0

If the error is

DB connection error: No database connection

It may be a problem logging into the database. Looking in the PostgreSQL logs revealed I had to adapt pg_hba.conf

--Albert25 11:45, 18 January 2011 (UTC)10:33, 10 January 2012

This is the error message I get when I try to execute the dumpHTML.php file on my local machine. Does anybody know a fix for that?

—The preceding unsigned comment was added by an unknown user on a unknown date.10:34, 10 January 2012
 

I have a similar problem which I cant solve:

DB connection error: Ein Verbindungsversuch ist fehlgeschlagen, da die Gegenstel
le nach einer bestimmten Zeitspanne nicht richtig reagiert hat, oder die hergest
ellte Verbindung war fehlerhaft, da der verbundene Host nicht reagiert hat.
 (localhost)
Mediawiki 13.3, Win7, Mowes webserver, just installed php5.3 without webserver support to execute the dumpHTML.php
I can trace the problem to occur in includes/db/Loadbalancer.php function reallyOpenConnection, $db = new $class( $host, $user, $password, $dbname, 1, $flags ); just times out
database class is DatabaseMysql (I cant find the source), host, database name, user and password are correct
any help?
—The preceding unsigned comment was added by an unknown user on a unknown date.10:34, 10 January 2012