Saturday, 6 August 2016

Speeding up Dokuwiki

I'm a big fan of Dokuwiki.
  • its simple
  • has a great ecosystem of plugins
  • has great performance
But some time ago I decided there was room for improvement so I wrote a very simple framework (itself implemented as a Dokuwiki plugin). I've just uploaded this at Github.

Specifically this allows for:
  • Much faster page loading using PJAX
  • Pure javascript/CSS extensions - no PHP required
  • Prevents Javascript injection by page editors
The PJAX page loading requires small changes to the template to exclude all but the page specific content (i.e. navigation elements and the rendered markup) when a PJAX request is made. Instead you just return a well-formed HTML fragment when the request is flagged as coming from PJAX. There is an example template here. While the template this is based on is already rather complex, the actual changes to this, or any existing template, are only a few lines of code - see the diff in the README.

This saves me around 450 milliseconds per page in loading time:

The savings come from not having to parse the CSS and Javascript on the browser. The serverside content generation time is not noticeably affected.

But even if you are not using Dokuwiki you can get the same benefits using PJAX on your CMS of choice.

A strict Content Security Policy provides great protection against XSS attacks. But the question then arises how to get run-time generated data routed to the right bit of code. Jokuwiki solves by this embedding JSON in data-* attributes including the entry point for execution.

Wednesday, 22 June 2016

Faster and more Scalable Session handling

Chapter 18 (18.13 specifically) looks at PHP session handling, which can be a major bottleneck. I suggested there were options for reducing the impact, top of which was to use a faster substrate for the session data, but no matter how fast the storage it won't help with the fact that (by default) control over concurrency is implemented by locking the data file.

While I provided an example in the book of propagating authentication and authorization information securely via the URL, removing the need to open the session in the linked page, sometimes you need access to the full session data.

Recently I wrote a drop in replacement for the default handler which is completely compatible with the default handler (you can mix and match the methods in the same application) but which does not lock the session data file. It struck me that there were lots of things the session handler was doing and which a custom handler might do. Rather than create every possible combination of storage / representation / replication / concurrency control, I adapted my handler API to allow multiple handlers to be stacked to create a custom combination.

The code (including an implementation of the non-blocking handler) is available on PHPClasses.

One thing I omitted to mention in the book is that when session_start() is called it sets the Cache-control and Expires headers to prevent caching:

Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0

If you want your page to be cacheable, then there is a simple 2 step process:

  1. Check - are you really, REALLY sure you want the content to be cacheable and use sessions? If you are just implementing access control, then the content *may* be stored on the users disk.
  2. Add an appropriate set of headers after session_start();
header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', time() + 3600));
header('Cache-Control: max-age=3600'); 
header('Varies: Cookie'); // ...but using HTTPS would be better 

Monday, 16 March 2015

Accurate capacity planning with Apache - protecting your performance

While most operating systems support some sort of virtual memory, if the system starts paging memory out to disk, performance will take a nose dive. But performance will typically be heavily degraded even before it runs out of memory as the applications start stealing memory used for I/O caching. Hence setting an appropriate value for ServerLimit in Apache (or the equivalent for any multi-threaded/multi-process server) is good practice. For the remainder of the document I will be specifically focussing on Linux, but the theory and practice apply to all flavours of Unix and MSWindows too.

Tracking resource usage of the system as a whole is also good practice – but beyond the scope of what I'll be talking about today.

The immediate problem is determining what an appropriate limit is.

For pre-fork Apache 2.x, the number of processes is constrained by the serverLimit setting

For most systems the limit will be driven primarily by the amount of memory available. But trying to workout how much memory a process uses is actually surprisingly difficult. The executable code is memory mapped files – these are typically readonly and shared between processes.

Running 'strace /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf' causes over 4000 files to be “loaded” on my local Linux machine. Actually few of the are read from disk – they are shared object files already in memory which the kernel then presents at an address accessible to the httpd process. Code is typically loaded into such shared, read only pages. Linux has a further way of conserving memory. When it needs to copy memory which might be written to, the copy is deferred until a process attempts to write to the memory.
The net result is that the actual footprint on the physical memory is much, much less than the size of the address space that the process has access to.

Different URLs will have different footprints, and even different clients can affect the memory usage. Here is a typical distribution of memory usage per httpd process:

This is further complicated by the fact that our webserver might be doing other things – running PHP, MySQL and a mailserver being obvious cases – which may or may not be linked to the volume of HTTP traffic being processed.

In short, trying to synthetically work out how much memory you will need to support (say) 200 concurrent requests is not practical.

The most effective solution is to start with an optimistic guess for serverLimit, and set MaxSpareServers to around 5% of this value. Note that after the data capture exercise, you should up MaxSpareServers to around 10% of serverLimit +3. Then measure how much memory is unused. To do that you'll need to set up a simple script running periodically as a daemon or from cron, capturing the output of the 'free' command and the number of httpd processes.

Here I've plotted the total memory used (less buffers and cache) against the number of httpd processes:

This system has 1Gb of memory. Without any apache instances running, the usage would be less than the projected 290Mb – but that is outwith the bounds we expect to be operating in. From 2 httpd processes upwards, the average size and variation in size for each httpd process is very consistent – but since the variation in size is consistent that means the size of the total usage envelope will expand as the number of processes increases. The dashed red line is 2 standard deviations above the average usage, and hence there is a 97.5% probability that memory usage will be below the dashed line.
I want to have around 200kb available for the VFS, so here, my ServerLimit is around 175.

Of course the story doesn't end there. How do you protect the server and manage the traffic effectively as it approaches the serverLimit? How do you reduce the memory usage per httpd process to get more capacity? How do you turn around requests faster and therefore reduce concurrency? And how do you know how much memory to set aside for the VFS?

For help with finding the answers, the code run here and more information on capacity and performance tuning Linux, Apache, MySQL and the book!

If you would like to learn more about how Linux Memory Management then this (731 page) document is a very good guide:

Monday, 2 March 2015

Making stuff faster with curl_multi_exec() and friends

Running stuff in parallel is a great way to solve some performance problems. My post on long running processes in PHP on my other blog continues to receive a lot of traffic - but a limitation of this approach (and any method which involves forking) is that it is hard to collate the results.
In the book I recommended using the curl_multi_ functions as a way of splitting a task across multiple processing units although I did not provide a detailled example.
I recently had cause to write a new bit of functionality which was an ideal for the curl_multi_ approach. Specifically I needed to implement a rolling data quality check checking that a few million email addresses had valid MX domain records. The script implementing this would be spending most of its time waiting for a response from the DNS system. While it did not have to be as fast as humanly possibly, the 50 hours it took to check the addresses one at a time was just a bit too long - I needed to run the checks in parallel.
While the PHP Curl extension does resolve names in order to make HTTP calls - it does not expose this as a result, and the targets were not HTTP servers, so I wrapped the getmxrr() function in a simple PHP script running at http://localhost.
To refresh my memory on the parameters past and the values returned I went and had a look at the PHP documentation.
The example of how to use the function in the curl_multi_exec() page is somewhat Byzantine:

do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {
    if (curl_multi_select($mh) != -1) {
        do {
            $mrc = curl_multi_exec($mh, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);

OMG - 3 seperate calls to curl_multi_  functions in 3 loops!

It doesn't exactly make it obvious what's going on here. It turns out that the guy who wrote it has since posted an explanation.

There are certain advantages to what the developer is trying to do here, but transparency is not one of them.

The example code in curl_multi_add_handle() is clearer, but somewhat flawed:

do {
} while($running > 0);

To understand what's really happening here, you need to bear in mind that the curl_multi_exec() is intended for implementing asynchronous fetching of pages - i.e. it does not block until it completes. In other words the above will run in a tight loop burning up CPU cycles while waiting for the responses to come in. Indeed it may actually delay the processing of the responses!
Now curl_multi_exec has a lot of work to do. For each instance registered it needs to resolve the vhost name, carry out a TCP handshake, possibly an SSL negotiation, send the HTTP request then wait for a response. Interestingly, when testing against localhost, it does nothing visible on the first invocation while it seems to at least as far as sending the HTTP requests on the second iteration of the loop, regardless of the number of requests. That means that the request has been dispatched to the receiving end, and we can now use our PHP thread to do something interesting / useful while we wait for a response, for example pushing the HEAD of your HTML out to the browser so it can start fetching CSS and (deferred) Javascript (see section 18.11.1 in the book).
Of course, even if I were to confirm that the TCP handshake runs in the second loop, and find out where any SSL handshake took place, there's no guarantee that this won't change in future. We don't know exactly how many iterations it takes to dispatch a request, and timing will be important too.
But it might be why the person who wrote the example code above split the functionality across the 2 consecutive loops - to do something useful in between. However on my local PHP install, on the first iteration through the loop it returns a 0 and CURLM_CALL_MULTI_PERFORM is -1. So the first loop will only run once, and won't send the requests (I tested by adding a long sleep after the call).

Hence I suggest that a better pattern for using curl_multi_exec() is:

do {
        curl_multi_exec($mh, $active);
        if ($active) usleep(20000);
} while ($active > 0);

The usleep is important! This stops the process from hogging the CPU and potentially blocking other things (it could even delay processing of the response!).

We can actually use the time spent waiting for the requests to be processed to do something more useful:


for ($x=0; $x<=3 && $active; $x++) {
        curl_multi_exec($mh, $active);
        // we wait for a bit to allow stuff TCP handshakes to complete and so forth...


do {
        curl_multi_exec($mh, $active);
        if ($active) usleep(20000);
} while ($active > 0);

Here the executions of curl_multi_exec() are split into 2 loops. From experimentation it seems it takes up to 4 iterations to properly despatch all the requests - then there is a delay waiting for the request to cross the network and be serviced - this is where we can do some work locally. The second loop then reaps the responses.

The curl_multi_select function can also be called with a timeout - this makes the function block, but allows the script to wake up early if there's any work to do...


for ($x=0; $x<=4 && $active; $x++) {
        curl_multi_exec($mh, $active);
        // we wait for a bit to allow stuff TCP handshakes to complete and so forth...      

        curl_mutli_select($mh, 0.02) 


do {

        // wait for everything to finish...
        curl_multi_exec($mh, $active);
        if ($active) {

            curl_mutli_select($mh, 0.05);
        // until all the results are in or a timeout occurs
} while ($active > 0 && (MAX_RUNTIME<microtime(true)=$started;);

One further caveat is that curl_multi_exec() does not check for the number of connections to a single host - so be careful if you are trying to send a large number of requests to the same host (see also 6.12.1 in the book).

Did it work? Yes, for up to 30 concurrent requests to localhost, the throughput increased linearly.

Thursday, 1 January 2015

File Formats in Wild, Wild West - a review of eReaders for (not very) complex layouts

Having looked around at various ebook publishers, some common features emerged. Most supported the same set of distribution channels. All accepted MSWord manuscripts. Most could distribute in Kindle (AZW/Mobi, AZW3), Epub and PDF formats. And yet all had what seemed like Draconian formatting standards. Obviously I wasn't expecting any issues with a PDF as a final format, but I did expect that the other, HTML based formats, would have no trouble with a fairly restricted set of set of HTML with limited CSS – specifically:
  • H1, H2, H3 and H4 headings
  • Bold, italic and underlined normal text
  • A monospace text for code samples
  • PNG images (anchored as paragraphs)
In addition, having explicit page breaks at the top of each chapter was a nice to have.

While Calibre seemed to cope with a reasonable conversion of the original OpenOffice manuscript, there was a lot of artefacts in the generated epub, and problems with image sizing. Rather than tinker directly with the epub or OpenOffice file, I exported the file as html from OpenOffice and set to cleaning it up. I was able to script most of this, but it still required hand editing to produce sensible, well-formed HTML (and to clean up the non-visible artefacts from OpenOffice, such as <I>unnecessarily</I><I> interrupted</I> <I>tags</i>). This approach had the desired outcome; converting the HTML to epub in Calibre eliminated all the conversion artefacts.

I now had something I could proof read in an e-reader.

In my research, I saw very few technical ebooks. I soon found out the reason why. And the reason for the restrictive style guides. It seems that very few readers are capable of visually representing the layout defined in the file – despite the file format explicitly supporting the HTML elements. In short, they did not properly implement the file format they claimed. Indeed missing it by a wide margin in many cases.

Here is a sample of the programs I tried on my Lenovo Yoga 10 (Android):
name Monospace
sans font? serif font? borders
on divs?
backgrounds notes
Moon+ eReader no no yes no sort of unexpected page breaks injected
Aldiko Reader no no yes no no unexpected reformatting of paragraphs
FB reader no yes no no no
eBook reader
by Vadim Lopatin
yes no yes no no
eBook reader
no no yes yes no
eReader Prestigious no no yes no no random blocks of whitespace,
particularly adjacent to imgs
epub reader
yes yes yes yes yes no margins, poor kerning and difficult navigation
UB reader yes yes yes yes yes does not ALIGN=CENTER
eBook reader
yes yes yes yes yes Does not pre-render prev/next pages
making turning pages slow and disconcerting.
Like UB reader, does not ALIGN=CENTER
Gitden Reader yes yes yes yes yes seems to be making its own mind up about image sizes
but making sensible choices. Otherwise strong
compatability and mathml support

(I was unable to open books stored on the local filesystem using Scribd or Kindle)
Overall, Gitden reader stood out as the most capable eReader for rendering layouts, with UB reader in a close second.
Although all were able to visually represent different header levels, the problems with fonts, borders and backgrounds meant that I have had to use all three techniques to have a reasonable chance that code samples will be visually different from body text in the rendered document.

Wednesday, 24 December 2014

Welcome to the LAMP End-to-End Blog

Welcome to the LAMP End-to-End Blog

I'm starting this blog as a more dynamic face for a set of eBooks. The first in the series will be LAMP Performance End-to-End. 

Watch this space for details of the publication date and supplementary articles.