Educating the world

Our blog has over 10,000 readers a month

Connection balancing across NLB using IIS and MaxKeepAliveRequests

September 21st, 2011

I have been doing a lot of work lately with Network Load Balancer (NLB) which is Microsoft’s clustering solution and Microsoft Internet Information Services (IIS).

We have written a video transcoding application which sits under a RESTful front end provided by IIS. The transcoding application is CPU bound, that is, the CPU is the first place to bottleneck and prevent the computer from doing more work. The heavy CPU is caused by video transcoding. This involves reading a unit of video from a video server, converting it to another format and squirting it out to a client. Transcoding video is a pipeline process which means there are huge performance advantages in processing a series of consecutive video units in a read-ahead fashion.

A normal web server could handle 2 or 3 orders of magnitude more requests than ours. As a result we found that it was more difficult to load balance across an NLB cluster because the number of new incoming connections was relatively small.

The application suite has been designed to be stateless in order to allow it to fit into a cluster architecture. We want to be able to scale outward more easily so in order to support more clients we can just add more boxes.

Our experiments have shown that 1 PC can support about 10 simultaneous clients before the system’s performance degrades to unusable levels. For each new PC we add to the cluster, we can get another 8-10 clients.

We would like to keep each client talking to the same cluster node for a short period so that we can get the benefit of pipe-lining requests, while at the same time we need to make sure that clients can move between cluster nodes in order to keep the load evenly balanced across the cluster.

There are several configuration options across NLB, IIS and some custom code that needs to be configured in order to build a suitable solution.

Under IIS, HTTP KeepAlive allows a client to connect once, then make as many requests down the connection pipe as it likes before the client closes the connection. The server will hang on to each client until they go away. If KeepAlive is switched off the connection will be closed at the end of each request which may add significant overheads to dealing with clients that a geologically distant. HTTP KeepAlive works on layer 5 of the OSI model.

NLB has a similar option called Affinity. The Affinity can be either sticky or non-sticky (there are other states but for the purposes of this article they can all be condensed into these two). Stickiness ensures that the same client is always directed to the same cluster node. NLB works on layer 4 of the OSI model.

The simplest solution is to switch NLB Affinity to non-sticky and set HTTP KeepAlive to false. Each incoming request that arrives at the cluster will be directed to a choice of machines, make its request, get the data and then tear down everything and start again for the next request. With this set up we will not be able to take any advantage of the pipe-lining efficiency that could be had and as a result the platform will be able to support fewer clients overall.

Each one of these technologies has advantages and disadvantages. The advantage of using stickiness with NLB is that you can ensure that all requests for a client, for the lifetime of the client or that cluster node will be directed to the same place. That will be good for pipe-lining but bad for load balancing. The advantages and disadvantages for HTTP KeepAlive are similar except here you are at the mercy of what the client decides to do.

In experiments we have shown that if one of the nodes in the cluster goes down the NLB will notice and rebalance; diverting incoming traffic to another node in the cluster. The HTTP KeepAlive clients will simply reconnect to the next allocated node in the cluster and stay there for the rest of their lives. This means that when a downed node comes back up, it balances with the rest of the cluster to make sure the request distribution is correct. NLB will not sever existing connections so all the existing clients will stay where they are. Only new incoming connections will be allocated to the newly added cluster node. So what we find is that after a cluster node failure the rest of the nodes take up the slack and end up working extra hard, but when the failed node re-enters the cluster it sits there doing nothing.

If you were dealing with thousands of small requests it would be a different story; it probably wouldn’t matter so much because new clients are coming and going all the time.

What we need is a combination of KeepAlive and not KeepAlive on a non-sticky platform. Apache has a configuration option called MaxKeepAliveRequests. This option severs the connection to the client after this many requests (the default is 100). With this option we can have 100 consecutive requests over the same connection to enjoy the benefits of pipe-lining the requests and yet we are giving the system/platform a chance to balance itself on a regular basis.

IIS has no concept of limiting the number of requests a connection can service, which probably goes some way to explaining why IIS only has 15.73% of the web server market. I posted a question on ServerFault but didn’t get a satisfactory response. The one reply I did get was from some one saying that if my application was truly stateless I needed to switch off KeepAlive altogether and take the penalty for the re-connection. While the application is stateless there are advantages to be had from batching requests together. An answer of it can’t be done or is not supported is, in my opinion not an answer. What they actually mean is that it is not supported yet. In I.T. almost everything *is* possible as long as you know what to do.

IIS7 has a new pipeline module architecture that allows you to inject code into the processing of a request at any one of about 12 different stages. The run line passes through each module at each requested stage in order to modify the request’s response.

When the module is loaded in, it reads the MaxKeepAliveRequests number from the web.config. For each request that comes in the module will remember the remote host, remote port and how many requests have been serviced by that combination. When the request is in its final stage we’ll check to see if the number of serviced requests is bigger than MaxKeepAliveRequests. If it is then we can inject a Connection: close into the response. This will make its way through IIS, safely closing the connection on it’s way out.

Surprisingly there was a great deal of confusion on MSDN documentation, blogs and forums surrounding how to force a close after a request. I found that HttpResponse.Close() can chop the end off the reply, HttpApplication.CompleteRequest() didn’t work because the request’s run line was already inside the EndRequest section of the pipeline. So I went back to the specification and in RFC2616: Section 8 - Connections it talks about injecting Connection: close into the response header so that after the response is sent out the server closes the connection. The closure forces the client to reconnect. I tried this using a telnet client (and not a web browser) and can reveal that it is the server that closes the connection and not the client deciding.

I had thought about using the Session to store the request count but I didn’t think it would help. If a proxy server is talking to your cluster then it may be interleaving requests from several sources with different session identifiers. We are interested in the transport layer, and not the session layer. We must use values from the transport layer to differentiate the clients in order to spread the load.

Simply compile up this C# and add it to your IIS integrated process pipe line.


using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Diagnostics;
using System.Collections.Specialized;
using System.IO;
namespace WebApplication1
    public class MaxKeepAliveRequestsModule : IHttpModule
        int maxRequests = 0;
        Dictionary<string, KeepAliveClient> record = new Dictionary<string, KeepAliveClient>();
        public MaxKeepAliveRequestsModule()
            Debug.WriteLine("Debug : MaxKeepAliveRequestsModule.con");
        public int MaxRequests
            get { return maxRequests; }
            set { maxRequests = value; }
        public void Init(HttpApplication context)
            Debug.WriteLine("Debug : creating MaxKeepAliveRequestsModule");
            string mrStr = System.Web.Configuration.WebConfigurationManager.AppSettings["MaxKeepAliveRequests"];
            maxRequests = validateMaxKeepAliveRequestsValue(mrStr);
            context.EndRequest += new EventHandler(OnEndRequest);
        private int validateMaxKeepAliveRequestsValue(string val)
            if (val == null || val.Length == 0)
                throw new ArgumentException("appSettings.MaxKeepAliveRequests is empty");
            int mr = Convert.ToInt32(val);
            if (mr < 1)
                throw new ArgumentException("appSettings.MaxKeepAliveRequests must be greater than zero: " + mr);
            return mr;
        public void Dispose()
            Debug.WriteLine("Debug : MaxKeepAliveRequestsModule.Dispose");
        public void OnEndRequest(Object source, EventArgs e)
            Debug.WriteLine("Debug : MaxKeepAliveRequestsModule.OnEndRequest");
            HttpApplication app = (HttpApplication) source;
            HttpRequest request = app.Context.Request;
            HttpResponse response = app.Context.Response;
            // Tried to use socket as the key, but don't seem to back access to it from here
            // Stream k = response.OutputStream;
            NameValueCollection serverVariables = request.ServerVariables;
            string k = serverVariables["REMOTE_HOST"] + ":" + serverVariables["REMOTE_PORT"];
            if (record.ContainsKey(k))
                KeepAliveClient c = record[k];
                Debug.WriteLine("Debug : MaxKeepAliveRequestsModule.OnEndRequest: hit");
                if (c.Hits > maxRequests)
                    Debug.WriteLine("Debug : MaxKeepAliveRequestsModule.OnEndRequest:max requests reached for " + k + "(" + c.Hits + "), force close connection to client");
                    // works, but may chop the end of the response
                    // response.Close();
                    // doesn't appear to work
                    // app.CompleteRequest();
                    response.Headers["Connection"] = "close";
                Debug.WriteLine("Debug : MaxKeepAliveRequestsModule.OnEndRequest: miss");
                record.Add(k, new KeepAliveClient(k));
        private void cleanOldKeepAliveRecords()
            foreach (KeepAliveClient cc in record.Values.ToList())
                if (cc.isExpired())
                    Debug.WriteLine("Debug : MaxKeepAliveRequestsModule.cleanOldKeepAliveRecords: key=" + cc.Key);
    class KeepAliveClient
        private static TimeSpan TIMEOUT = new TimeSpan(1, 0, 0); // hour
        private DateTime now;
        private int hits;
        private string key;
        public KeepAliveClient(string key)
            this.key = key;
            now = DateTime.Now;
            hits = 1;
        public int Hits
            get { return hits; }
        public string Key
            get { return key; }
        public void touch()
            now = DateTime.Now;
        public bool isExpired()
            return now + TIMEOUT < DateTime.Now;

You’ll need to add the configuration option to the web.config


    <add key="MaxKeepAliveRequests" value="100"/>

Useful B2Evolution SQL

September 20th, 2011

While B2Evolution’s search and back office interfaces are excellent they cater for the general use case. Search is a good example of this as there are many things to search for but a limited amount of search criteria in the interface. There are also tasks you might want to perform that would normally mean spending hours going through the web interface.

SQL is the way to go. It allows you to update your blogs structure underneath in a couple of well crafted statements. So I thought I’d share them with you.

Deleting a category with many dependencies
I came across a problem with deleting categories. If there are lots of posts using the category you want to delete, then you have to go into each post by hand and change the dependency on the category.

Finding articles without any tags
I wrote a related article plug-in that relies on tags so I wanted to find out which of my live blog articles don’t have enough tags. This piece of SQL gives you a list of blog article titles followed by the number of tags that are associated with each article. I have added a filter to ignore articles that have more than a set number of tags.

SELECT @mintags := 2;
SELECT post_ID, post_title, count(itag_tag_ID) ntags FROM evo_items__item
LEFT JOIN evo_items__itemtag ON itag_itm_ID = post_ID
WHERE post_status = ‘published’
HAVING ntags < @mintags

You can pull out the post_id and use it in this SQL that displays the full set of tags for an article.

SELECT @postid := 30;
SELECT itag_itm_ID, tag_name
FROM evo_items__itemtag, evo_items__tag
WHERE itag_itm_ID = @postid
AND itag_tag_ID = tag_id

Viewing comments and deleting spam

SELECT comment_ID, comment_date, comment_author, comment_author_IP, SUBSTRING(TRIM(comment_content),1,150)
FROM evo_comments
WHERE comment_status="draft"
ORDER BY comment_author_IP

Then delete them all:

DELETE FROM evo_comments
WHERE comment_status="draft"

Or delete by IP address:

DELETE FROM evo_comments
WHERE comment_status="draft"
AND comment_author_IP IN (’xxx,xxx,xxx,xxx’, ‘yyy,yyy,yyy,yyy’)

Come back Eclipse all is forgiven

August 19th, 2011

I have complained about Eclipse IDE in the past because it can be a little tricky to set things up especially when it comes to web development and JEE. That all stops now! Recently, I have been using C-sharp (c#) with Visual Studio 10 and IIS 7 and have found it really buggy with in comprehensible documentation.

The C# API documentation is really annoying. Take a look at System.Threading.Timer for example. In the list of constructors if gives the type of the object but not what it is for! Generally it’s the other way around, if you know what it is for you can infer the type yourselves. In contrast take a look at the equivalent in Java. The constructor list of java.util.Timer gives you the type, what it is for and a very clear definition of what its purpose is. The MSDN documentation has you hopping all over the place trying to find out what various options are for and how to use them. If you can’t find what it is for or how to use it then you can’t even use the source because there isn’t any.

The nice thing about open technology is that everyone is thrown in together. We all know and understand that there isn’t much documentation and the support is generally provided at the expense of someone’s free time. The side effect of this is that there are loads of blogs and publicly available articles which detail how to do things. This is the opposite of what I have experienced while using Microsoft technologies. It could be the fact that Microsoft have written loads of documentation that makes people want to push through and understand it rather than thinking “this is rubbish, I’ll look for something else on the internet". Microsoft’s documentation is just that: documentation. You have to start from the beginning and plough through it until you get to the end. There is almost nothing on how to do a particular job or a Frequently Asked Question. Sure there are forums but you can spend your life in there and there are so many questions from thicky people who haven’t done *any* reading and don’t know or understand any of the concepts.

I had started this article with the title Debugging web applications under IIS 7 using Visual Studio 10 but have already changed the subject enough to make the article about something else. So I will finish off with a couple of Eclipse features that I’m really missing in Visual Studio 10 and maybe a few gripes with VS10.

  1. I’ve been trying to run an IIS module in debug mode. All the documentation says that the development environment comes with its own web server that is comparable with the normal IIS Web Server. Well try as I might I couldn’t get this to work. The options that looked like what I needed were greyed out in the web application properties and when I did finally get the debugger to launch it moaned about not being able to find the configuration files. The only way I could get the debugger to hit my breakpoint was to ditch the built in web server, publish to a different folder and set up IIS to have an application under a virtual folder.
  2. But it didn’t stop there! There is a conflict between IIS and Visual Studio because when you close Visual Studio down it forgets the setup so you have do it all again each time you start. VS10 opens your project and sees you are using a the local web server and have a project url. It checks the IIS configuration and says that there is already a virtual directory. The project loads in but most of the artefacts are missing. You have to delete the virtual directory in the IIS manager, reload the project, get VS10 to create the virtual folder again, go back to IIS manager and change the location of the directory it maps onto.
  3. In Eclipse, your can store you projects anywhere. We use backed up network drives (and Clearcase) to hold our projects. Visual studio doesn’t seem to like projects on remote file systems. You have to set up all sorts of additional trust arrangements with different hosts which is a massive pain.
  4. If the worse comes to the worse and you can’t find a problem you can download all the source code for all the libraries and step through every line of code. Not true with IIS - there is no code.
  5. You have to explicitly add files to the project. You can’t say everything in the folder is in my project, refresh and pick up the new files I’ve added. If you remove a file from the directory, you can’t delete it from the project. VS10 sees that it can’t delete the file and so won’t let you delete it from the project. Hitting refresh has no effect.
  6. Probably because you have to explicitly add all the files to the project there are loads of configuration files that have to be maintained.
  7. Can’t run the web application in situ.
  8. Visual Studio tries to manage all the artifices of the project so that when you change a bit of it it updates all the dependencies, but it doesn’t. There are loads of places where it doesn’t and they are not immediately obvious. You’ll only find then if you are trying to run in debug mode or something.
  9. Visual Studio never seems to tell you anything. Every time you do something in Eclipse it talks to you through the output window errors are easy to see.
  10. Switching on debug inside IIS or Visual Studio or the publisher is almost impossible.
  11. IIS and Visual Studio are only concerned with using other Microsoft technologies. So if you think that Microsoft SQLServer is a colossal waste of time and effort and you want to use MySQL instead then there is no help with setting that up.
  12. The debugger is, frankly, rubbish in comparison to Eclipse. Eclipse gives you instance id and all sorts of extra typing information which is linked into the “go to definition of” menus.
  13. Visual Studio wouldn’t let me write code and hot deploy, so you are constantly stopping editing, building, publishing and starting.
  14. Context menu while over the source is rubbish too. Eclipse seems to have a greater understanding of the make up of the language so it lets you jump to base classes, type definitions just by context clicking on a variable, class or a definition.
  15. When the command completion helper pops up, it only contains entries that are provided by the “using” statements you have used. So it’s not much help. Eclipse gives you a selection of commands that are available in related packages. When you select on of those commands Eclipse adds the import to the top of the file. With Visual Studio you have to hunt around for the interface or class you want, then find out which package it belongs to, then add the “using” statement to the top of the files and only then can you get it to auto-complete. Useless really!
  16. It is not possible to resize the auto-completion pop-up dropdown box. If there are loads of things to look at then it can take a while to search through using such a small viewport.
  17. When you hover over a call in the source, the tool tips pops up and then disappears after a second, usually before I’ve finished reading it. Eclipse lets you hoover over the tool tip and as a result to will change to something that stays open (until you click away from it). The most useful thing about this feature is that while the tool tip is in this semi-permanent state you can select the text contained within it. This is really useful when you are trying to get the assembly or class path to a class for a using statement. With VS10 you have to remember as much as you can and keep going back to re-read the tool tip.
  18. When you press control and space to pop up the auto-completion, Visual Studio draws a box around the option it thinks you want but it doesn’t highlight it. It usually gets the correct option, but because it isn’t highlighted you have to press an up or down arrow key to select it then you can press return to choose it. In Eclipse it highlights it so you just do a control+space then hit return. I can’t seem to get used to this useless extra step so I’m constantly adding carriage returns to my code which is breaking my flow.
  19. Each time there is network slowness then the whole of Visual Studio locks up. I think it rechecks every minute or so because it is ok for about a second then it will lock up for another minute. If you try and kill it, it ignores you. Whereas if you log out, it shuts down straight away. I’ve been so close to hitting the reset button. I wouldn’t mind but most of the time I’m not actually doing anything. Visual Studio must be constantly checking something in the background. Eclipse would be a bit sluggish but nowhere near as bad as Visual Studio - it’s absolutely unusable.
  20. When you are running something, both IDE’s debug messages are directed to the Output panel at the bottom of the screen. In Eclipse this panel is editable and in Visual Studio it is not. The most useful part of this feature is that it allows you to add a couple of blank lines to help break up all the text. When you are re-running an area of code the debug messages repeat and it is quite difficult on the eye to track through all that text looking for the beginning of your test. If you add a couple of blank lines then it is obvious where the start and end are.
  21. If you have lots of tabs in Visual Studio and you want to close some of them. You must click the tab, move to the other side of the screen to click the ‘X’ button, then back to where all the tabs are, then back to the ‘X’ button. Alternatively you can click the tab you want, context-click on the tab, move to second option ‘Close’. If you are using Eclipse, just click the ‘X’ on the tab - no dragging the mouse across the screen and back or phaffing with right-clicks.
  22. In Visual Studio when you try to view/update the project or solution properties, the window that appears is tiny. Almost every option goes off the end of the panel and most annoyingly the panel is not expandable. You find you are constantly going back and fore expanding one thing to see what the values should be. Eclipse lets you expand the options panel as large as you want. Eclipse even has a little search box that lets you search the option pages for the option you are looking for.
  23. Don’t even get me started on trying to compile or run a Visual Studio project from a network share. It is so buggy that it is unusable. The simple solution is to copy your files from the share to the local hard drive and work on it from there. Ironically there is a Clearcase integration plug-in for Visual Studio, but you can’t use it on dynamic views because Visual Studio can’t cope with network shares. There is a load of settings related to running or compiling projects on network shares but they don’t make any difference. The Microsoft Connect forums are full of bug reports about this where the last entry is “Microsoft are looking into it". Snapshot views reside on the local filesystem so they work fine. Eclipse will let you put your projects anywhere.
  24. Half of the internet is taken up with Visual Studio questions relating to “No native symbols in symbol file” and “Cannot find or open the PDB file” and most of them go unanswered. When you get the same problem under Eclipse, you just right-click on the warning popup and select specify location, go to the location where the symbols file is and load it in. Job done. Eclipse doesn’t care that the symbol file might not be next to the executable file. We can save literally thousands of hours.
  25. This is the one that grates me the most. In Visual Studio, when you create files using the right click on project name add, then Visual Studio fills the project file with all kinds of flags relating to the type of that file. So if you create the file separately and drop it in or change the extension of an existing file to a different type which might have flags then the file won’t behave like the other files of that type in the project. So I create a .cs file and rename it to .ashx.cs then add the missing .ashx file so that it looks like the other handlers but when I publish my new files don’t get published like the others do. There’s no way to find out what is missing, because all the icons look the same and the code is the same. So you spend all your time trying to figure out why one file is not working like all the others and the only thing you can do is create a new project and cut and paste all the source code. In Eclipse you right-click the project and select “Refresh". The Eclipse project can be recreated from all the artefacts in the project where as Visual Studio can not. Now I have to find out why my Session_Start() is not being called even though everything appears the same as the last project I did. Useless!

Sorry Eclipse all is forgiven.

Just as a foot note Mirco$oft have produced a free Visual Studio 10 plugin called “Productivity Power Tools” which goes some way to provide a feature set similar to Eclipse. It’s miles off the mark but it’s a start.

Fixing problems caused by updating Plesk

August 16th, 2011

Permanently deleting MediaWiki pages and revisions

August 3rd, 2011

Spammers target wiki’s because many of them are mis-configured. They use robot’s to submit page updates and page additions. I could understand it if they were adding advertising, redirecting to other sites or adding Trojans but they’re not. The page updates are nonsense like:

AQqGJP , [url=]cqvbgfwzwypk[/url], [link=]pfjbydaloorn[/link],

or random sentences like:

Thank God! Someone with barnis speaks!

It serves no purpose other than giving me more work to do. Mind you on this occasion I got a blog article out of it, so it’s not all bad!

In spite of having my configuration set the way I want it they have some how managed to by pass them; there is probably a bug in MediaWiki, so the next job is to install the latest version.

The central table inside MediaWiki is mw_revision. It holds all the versions of each page. The history of a document is a timestamped ordered list of revision records.

MediaWiki uses hard coded numbers to represent the namespace column value in the mw_page table, so we will create a new table to hold the values. These will make it easier to see which namespace our articles are in:

,name VARCHAR(20)
INSERT INTO cats(id, name) VALUES

If we look at one of our pages:

SELECT page_id, CONCAT(name,page_title) full_name
FROM mw_page, cats
WHERE id = page_namespace
AND page_title like ‘%How_to_re%’;

We can see that there are 2 pages: the original and an extra page with the same name but in a different namespace.

| page_id | full_name                               |
|       9 | How_to_report_problems_effectively      |
|      79 | Talk:How_to_report_problems_effectively |

We will investigate each page in turn.

First page 9 “How to report problems effectively". Here is the SQL to look at the revision history of the page:

SELECT @page:=9;
SELECT rev_id, rev_page, rev_text_id,
SUBSTR(rev_comment,1,15) comment, rev_user,
rev_user_text, rev_timestamp, rev_len, rev_parent_id
FROM mw_revision
WHERE rev_page=@page
ORDER BY rev_timestamp;

We can see from the results below that the last 3 entries are spam.

| rev_id | rev_page | rev_text_id | comment         | rev_user | rev_user_text   | rev_timestamp  | rev_len | rev_parent_id |
|     10 |        9 |          10 | New page:  In a |        1 | Davidnewcomb    | 20071210142814 |    4057 |             0 |
|     11 |        9 |          11 |                 |        1 | Davidnewcomb    | 20071210142854 |    4022 |            10 |
|     13 |        9 |          13 |                 |        1 | Davidnewcomb    | 20071211110321 |    4584 |            11 |
|     25 |        9 |          25 |                 |        1 | Davidnewcomb    | 20071216172547 |    4603 |            13 |
|     79 |        9 |          79 |                 |        1 | Davidnewcomb    | 20071220185710 |    6291 |            25 |
|     80 |        9 |          80 |                 |        1 | Davidnewcomb    | 20071220192557 |    6938 |            79 |
|    770 |        9 |         770 |                 |        1 | Davidnewcomb    | 20100322103154 |    6939 |            80 |
|    833 |        9 |         833 | wIojbsxYCEMAiHX |        0 |  | 20110608224510 |      45 |           770 |
|    841 |        9 |         841 | FAjVBqBMj       |        0 |    | 20110609091850 |      59 |           833 |
|    854 |        9 |         854 | yrZXYJobVab     |        0 | | 20110609141813 |     135 |           841 |

We must delete the text records associated with those revisions:

WHERE old_id IN (833, 841, 854);

Next we need to delete the revision entries.

DELETE FROM mw_revision
WHERE rev_id IN (833, 841, 854);

Now that the page is clean we must solder up the pages table to make that page point to the latest version. Firstly record the timestamp of the most recent revision. Using the timestamp and the page number as an index collect the latest revision number and the latest revision’s length.

SELECT @most_recent := MAX(rev_timestamp)
FROM mw_revision
WHERE rev_page = @page;

SELECT @latest := rev_id, @latest_len := rev_len
FROM mw_revision
WHERE rev_timestamp = @most_recent;

When we have everything we need, update the page table:

UPDATE mw_page SET
page_latest = @latest,
page_len = @latest_len
WHERE page_id = @page;

Next we’ll check out the other page with the same name:

SELECT @page:=79;
SELECT rev_id, rev_page, rev_text_id, substr(rev_comment,1,15) comment,
rev_user, rev_user_text, rev_timestamp, rev_len, rev_parent_id
FROM mw_revision
WHERE rev_page=@page
ORDER BY rev_timestamp DESC;

We can see there is only one entry and that is a spam entry.

| rev_id | rev_page | rev_text_id | comment       | rev_user | rev_user_text | rev_timestamp  | rev_len | rev_parent_id |
|    825 |       79 |         825 | QorQCGouPrXjN |        0 |   | 20110608164848 |      26 |             0 |

We must delete the text records associated with those revisions:

DELETE FROM mw_text WHERE old_id IN (825);
DELETE FROM mw_revision WHERE rev_id IN (825);

This is a page we want to get rid of so we will just delete the entry from the page table.

DELETE FROM mw_page WHERE page_id = @page;

If you are going to be doing lots of this then here are some procedures to help. This SQL will produce a report to help you identify pages that have been attacked.

SELECT rev_id, rev_page, SUBSTR(CONCAT(name, page_title), 1, 10) full_name, substr(rev_comment, 1, 10) comment, rev_user_text, replace(substr(old_text, 1, 50), “\n","") txt
FROM mw_revision, mw_text, mw_page, cats
WHERE old_id = rev_text_id
AND rev_page = page_id
AND id = page_namespace
ORDER BY rev_page, rev_timestamp;

This procedure helps to delete all the artefacts for a page.

CREATE PROCEDURE delete_page (IN p_id INT)
DELETE FROM mw_text WHERE old_id IN
(SELECT rev_text_id
FROM mw_revision
WHERE rev_page = p_id);

DELETE FROM mw_revision WHERE rev_page = p_id;
DELETE FROM mw_page WHERE page_id = p_id;
END //

It can be called by:

CALL delete_page(123);

Where as this SQL deletes all the revisions after a point. To find the identifier of the last good version, use the web interface and go to the page that has been altered. Click the history link for the page. Hover over the revision’s “addition time” link. The number you want is in the oldid part of the URL. In this case, we don’t need the page id because a revision can only be part of one page. You can compare the selected revisions to make sure you have the correct one.

CREATE PROCEDURE delete_after_rev
(IN p_last_good_rev INT)

SELECT @page:= rev_page
FROM mw_revision
WHERE rev_id = p_last_good_rev;

DELETE FROM mw_text WHERE old_id IN
(SELECT rev_text_id FROM mw_revision
WHERE rev_page = @page AND rev_id > p_last_good_rev);

DELETE FROM mw_revision
WHERE rev_page = @page AND rev_id > p_last_good_rev;

SELECT @most_recent := MAX(rev_timestamp)
FROM mw_revision WHERE rev_page = @page;

SELECT @latest := rev_id, @latest_len := rev_len
FROM mw_revision
WHERE rev_timestamp = @most_recent;

UPDATE mw_page SET
page_latest = @latest,
page_len = @latest_len
WHERE page_id = @page;

END //

It can be called by:

CALL delete_after_rev(1234);

When the page is over written by the spammer it severs all the Category and Section links. A new page is generated and cached. The side effects of this will manifest themselves as incomplete Category pages, pages not appearing and your searches not returning the correct results. Click the Special:LonelyPages page to display the Orphan pages; this special page can act as a guide to help you track down all the pages that have been tampered with.

To remove the page cache run this SQL:

TRUNCATE mw_objectcache;

As pages are requested they will be rebuilt and cached so there will be a short lag as this happens.

To rebuild the search indexes drop to the command line and run the rebuildall command. There is an .htaccess to prevent you from running this file via the web. I tried removing this but the program wouldn’t run, so it must be run from the command line.

cd <wiki-root>
php maintenance/rebuildall.php

My wiki is less that a hundred pages so the rebuild only took a second. The output is here for your reference:

PHP Warning:  xxx
** Rebuilding fulltext search index (if you abort this will break searching; run this script again to fix):
Dropping index...
Rebuilding index fields for 83 pages...

Rebuild the index...

** Rebuilding recentchanges table:
Loading from page and revision tables...
$wgRCMaxAge=7862400 (91 days)
Updating links and size differences...
Loading from user, page, and logging tables...
Flagging bot account edits...
Flagging auto-patrolled edits...
Deleting feed timestamps.

** Rebuilding links tables -- this can take a long time. It should be safe to abort via ctrl+C if you get bored.
Refreshing redirects table.
Starting from page_id 1 of 83.
Refreshing links table.
Starting from page_id 1 of 83.
Retrieving illegal entries from pagelinks... 0..0
Retrieving illegal entries from imagelinks... 0..0
Retrieving illegal entries from categorylinks... 0..0
Retrieving illegal entries from templatelinks... 0..0
Retrieving illegal entries from externallinks... 0..1

If you don’t have access to the command line then running a web spider across your site would force the pages to rebuild and update their search caches but it would take much longer.

When you are all done, you ought to clean up the procedures and tables that we have created because they are no longer needed:

DROP PROCEDURE delete_page;
DROP PROCEDURE delete_after_rev;

Happy repairing !

Thanks to David Aldridge for giving me a starting point.