Spammers target wiki’s because many of them are mis-configured. They use robot’s to submit page updates and page additions. I could understand it if they were adding advertising, redirecting to other sites or adding Trojans but they’re not. The page updates are nonsense like:
AQqGJP , [url=http://cqvbgfwzwypk.com/]cqvbgfwzwypk[/url], [link=http://pfjbydaloorn.com/]pfjbydaloorn[/link], http://bkxerafkfcgr.com/
or random sentences like:
Thank God! Someone with barnis speaks!
It serves no purpose other than giving me more work to do. Mind you on this occasion I got a blog article out of it, so it’s not all bad!
In spite of having my configuration set the way I want it they have some how managed to by pass them; there is probably a bug in
MediaWiki, so the next job is to install the latest version.
The central table inside
MediaWiki is
mw_revision
. It holds all the versions of each page. The history of a document is a timestamped ordered list of revision records.
MediaWiki uses hard coded numbers to represent the
namespace column value in the
mw_page
table, so we will create a new table to hold the values. These will make it easier to see which
namespace our articles are in:
DROP TABLE IF EXISTS cats;
CREATE TABLE cats
(
id INT PRIMARY KEY
,name VARCHAR(20)
);
INSERT INTO cats(id, name) VALUES
(’0′,'’)
,(’-1′,’Special:’)
,(’-2′,’Medi:’)
,(’1′,’Talk:’)
,(’2′,’User:’)
,(’3′,’User_talk:’)
,(’4′,’Project:’)
,(’5′,’Project_talk:’)
,(’6′,’File:’)
,(’7′,’File_talk:’)
,(’8′,’MediaWiki:’)
,(’9′,’MediaWiki_talk:’)
,(’10′,’Template:’)
,(’11′,’Template_talk:’)
,(’12′,’Help:’)
,(’13′,’Help_talk:’)
,(’14′,’Category:’)
,(’15′,’Category_talk:’)
;
If we look at one of our pages:
SELECT page_id, CONCAT(name,page_title) full_name
FROM mw_page, cats
WHERE id = page_namespace
AND page_title like ‘%How_to_re%’;
We can see that there are 2 pages: the original and an extra page with the same name but in a different
namespace.
+---------+-----------------------------------------+
| page_id | full_name |
+---------+-----------------------------------------+
| 9 | How_to_report_problems_effectively |
| 79 | Talk:How_to_report_problems_effectively |
+---------+-----------------------------------------+
We will investigate each page in turn.
First page 9 “How to report problems effectively". Here is the SQL to look at the revision history of the page:
SELECT @page:=9;
SELECT rev_id, rev_page, rev_text_id,
SUBSTR(rev_comment,1,15) comment, rev_user,
rev_user_text, rev_timestamp, rev_len, rev_parent_id
FROM mw_revision
WHERE rev_page=@page
ORDER BY rev_timestamp;
We can see from the results below that the last 3 entries are spam.
+--------+----------+-------------+-----------------+----------+-----------------+----------------+---------+---------------+
| rev_id | rev_page | rev_text_id | comment | rev_user | rev_user_text | rev_timestamp | rev_len | rev_parent_id |
+--------+----------+-------------+-----------------+----------+-----------------+----------------+---------+---------------+
| 10 | 9 | 10 | New page: In a | 1 | Davidnewcomb | 20071210142814 | 4057 | 0 |
| 11 | 9 | 11 | | 1 | Davidnewcomb | 20071210142854 | 4022 | 10 |
| 13 | 9 | 13 | | 1 | Davidnewcomb | 20071211110321 | 4584 | 11 |
| 25 | 9 | 25 | | 1 | Davidnewcomb | 20071216172547 | 4603 | 13 |
| 79 | 9 | 79 | | 1 | Davidnewcomb | 20071220185710 | 6291 | 25 |
| 80 | 9 | 80 | | 1 | Davidnewcomb | 20071220192557 | 6938 | 79 |
| 770 | 9 | 770 | | 1 | Davidnewcomb | 20100322103154 | 6939 | 80 |
| 833 | 9 | 833 | wIojbsxYCEMAiHX | 0 | 195.251.148.31 | 20110608224510 | 45 | 770 |
| 841 | 9 | 841 | FAjVBqBMj | 0 | 50.19.100.18 | 20110609091850 | 59 | 833 |
| 854 | 9 | 854 | yrZXYJobVab | 0 | 201.248.150.100 | 20110609141813 | 135 | 841 |
+--------+----------+-------------+-----------------+----------+-----------------+----------------+---------+---------------+
We must delete the text records associated with those revisions:
DELETE FROM mw_text
WHERE old_id IN (833, 841, 854);
Next we need to delete the revision entries.
DELETE FROM mw_revision
WHERE rev_id IN (833, 841, 854);
Now that the page is clean we must solder up the pages table to make that page point to the latest version. Firstly record the timestamp of the most recent revision. Using the timestamp and the page number as an index collect the latest revision number and the latest revision’s length.
SELECT @most_recent := MAX(rev_timestamp)
FROM mw_revision
WHERE rev_page = @page;
SELECT @latest := rev_id, @latest_len := rev_len
FROM mw_revision
WHERE rev_timestamp = @most_recent;
When we have everything we need, update the page table:
UPDATE mw_page SET
page_latest = @latest,
page_len = @latest_len
WHERE page_id = @page;
Next we’ll check out the other page with the same name:
SELECT @page:=79;
SELECT rev_id, rev_page, rev_text_id, substr(rev_comment,1,15) comment,
rev_user, rev_user_text, rev_timestamp, rev_len, rev_parent_id
FROM mw_revision
WHERE rev_page=@page
ORDER BY rev_timestamp DESC;
We can see there is only one entry and that is a spam entry.
+--------+----------+-------------+---------------+----------+---------------+----------------+---------+---------------+
| rev_id | rev_page | rev_text_id | comment | rev_user | rev_user_text | rev_timestamp | rev_len | rev_parent_id |
+--------+----------+-------------+---------------+----------+---------------+----------------+---------+---------------+
| 825 | 79 | 825 | QorQCGouPrXjN | 0 | 74.119.98.5 | 20110608164848 | 26 | 0 |
+--------+----------+-------------+---------------+----------+---------------+----------------+---------+---------------+
We must delete the text records associated with those revisions:
DELETE FROM mw_text WHERE old_id IN (825);
DELETE FROM mw_revision WHERE rev_id IN (825);
This is a page we want to get rid of so we will just delete the entry from the page table.
DELETE FROM mw_page WHERE page_id = @page;
If you are going to be doing lots of this then here are some procedures to help. This SQL will produce a report to help you identify pages that have been attacked.
SELECT rev_id, rev_page, SUBSTR(CONCAT(name, page_title), 1, 10) full_name, substr(rev_comment, 1, 10) comment, rev_user_text, replace(substr(old_text, 1, 50), “\n","") txt
FROM mw_revision, mw_text, mw_page, cats
WHERE old_id = rev_text_id
AND rev_page = page_id
AND id = page_namespace
ORDER BY rev_page, rev_timestamp;
This procedure helps to delete all the artefacts for a page.
DELIMITER //
CREATE PROCEDURE delete_page (IN p_id INT)
BEGIN
DELETE FROM mw_text WHERE old_id IN
(SELECT rev_text_id
FROM mw_revision
WHERE rev_page = p_id);
DELETE FROM mw_revision WHERE rev_page = p_id;
DELETE FROM mw_page WHERE page_id = p_id;
END //
DELIMITER ;
It can be called by:
CALL delete_page(123);
Where as this SQL deletes all the revisions after a point. To find the identifier of the last good version, use the web interface and go to the page that has been altered. Click the
history link for the page. Hover over the revision’s “addition time” link. The number you want is in the
oldid
part of the URL. In this case, we don’t need the page id because a revision can only be part of one page. You can compare the selected revisions to make sure you have the correct one.
DELIMITER //
CREATE PROCEDURE delete_after_rev
(IN p_last_good_rev INT)
BEGIN
SELECT @page:= rev_page
FROM mw_revision
WHERE rev_id = p_last_good_rev;
DELETE FROM mw_text WHERE old_id IN
(SELECT rev_text_id FROM mw_revision
WHERE rev_page = @page AND rev_id > p_last_good_rev);
DELETE FROM mw_revision
WHERE rev_page = @page AND rev_id > p_last_good_rev;
SELECT @most_recent := MAX(rev_timestamp)
FROM mw_revision WHERE rev_page = @page;
SELECT @latest := rev_id, @latest_len := rev_len
FROM mw_revision
WHERE rev_timestamp = @most_recent;
UPDATE mw_page SET
page_latest = @latest,
page_len = @latest_len
WHERE page_id = @page;
END //
DELIMITER ;
It can be called by:
CALL delete_after_rev(1234);
When the page is over written by the spammer it severs all the Category and Section links. A new page is generated and cached. The side effects of this will manifest themselves as incomplete Category pages, pages not appearing and your searches not returning the correct results. Click the
Special:LonelyPages page to display the
Orphan pages; this special page can act as a guide to help you track down all the pages that have been tampered with.
To remove the page cache run this SQL:
TRUNCATE mw_objectcache;
As pages are requested they will be rebuilt and cached so there will be a short lag as this happens.
To rebuild the search indexes drop to the command line and run the rebuildall command. There is an
.htaccess
to prevent you from running this file via the web. I tried removing this but the program wouldn’t run, so it must be run from the command line.
cd <wiki-root>
php maintenance/rebuildall.php
My wiki is less that a hundred pages so the rebuild only took a second. The output is here for your reference:
PHP Warning: xxx
** Rebuilding fulltext search index (if you abort this will break searching; run this script again to fix):
Dropping index...
Rebuilding index fields for 83 pages...
0
Rebuild the index...
Done.
** Rebuilding recentchanges table:
Loading from page and revision tables...
$wgRCMaxAge=7862400 (91 days)
Updating links and size differences...
Loading from user, page, and logging tables...
Flagging bot account edits...
Flagging auto-patrolled edits...
Deleting feed timestamps.
Done.
** Rebuilding links tables -- this can take a long time. It should be safe to abort via ctrl+C if you get bored.
Refreshing redirects table.
Starting from page_id 1 of 83.
Refreshing links table.
Starting from page_id 1 of 83.
Retrieving illegal entries from pagelinks... 0..0
Retrieving illegal entries from imagelinks... 0..0
Retrieving illegal entries from categorylinks... 0..0
Retrieving illegal entries from templatelinks... 0..0
Retrieving illegal entries from externallinks... 0..1
Done.
If you don’t have access to the command line then running a web spider across your site would force the pages to rebuild and update their search caches but it would take much longer.
When you are all done, you ought to clean up the procedures and tables that we have created because they are no longer needed:
DROP TABLE cats;
DROP PROCEDURE delete_page;
DROP PROCEDURE delete_after_rev;
Happy repairing !
Thanks to
David Aldridge for giving me a starting point.