LAMPe2e: March 2015

While most operating systems support some sort of virtual memory, if the system starts paging memory out to disk, performance will take a nose dive. But performance will typically be heavily degraded even before it runs out of memory as the applications start stealing memory used for I/O caching. Hence setting an appropriate value for ServerLimit in Apache (or the equivalent for any multi-threaded/multi-process server) is good practice. For the remainder of the document I will be specifically focussing on Linux, but the theory and practice apply to all flavours of Unix and MSWindows too.

Tracking resource usage of the system as a whole is also good practice – but beyond the scope of what I'll be talking about today.

The immediate problem is determining what an appropriate limit is.

For pre-fork Apache 2.x, the number of processes is constrained by the serverLimit setting

For most systems the limit will be driven primarily by the amount of memory available. But trying to workout how much memory a process uses is actually surprisingly difficult. The executable code is memory mapped files – these are typically readonly and shared between processes.

Running 'strace /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf' causes over 4000 files to be “loaded” on my local Linux machine. Actually few of the are read from disk – they are shared object files already in memory which the kernel then presents at an address accessible to the httpd process. Code is typically loaded into such shared, read only pages. Linux has a further way of conserving memory. When it needs to copy memory which might be written to, the copy is deferred until a process attempts to write to the memory.

The net result is that the actual footprint on the physical memory is much, much less than the size of the address space that the process has access to.

Different URLs will have different footprints, and even different clients can affect the memory usage. Here is a typical distribution of memory usage per httpd process:

This is further complicated by the fact that our webserver might be doing other things – running PHP, MySQL and a mailserver being obvious cases – which may or may not be linked to the volume of HTTP traffic being processed.

In short, trying to synthetically work out how much memory you will need to support (say) 200 concurrent requests is not practical.

The most effective solution is to start with an optimistic guess for serverLimit, and set MaxSpareServers to around 5% of this value. Note that after the data capture exercise, you should up MaxSpareServers to around 10% of serverLimit +3. Then measure how much memory is unused. To do that you'll need to set up a simple script running periodically as a daemon or from cron, capturing the output of the 'free' command and the number of httpd processes.

Here I've plotted the total memory used (less buffers and cache) against the number of httpd processes:

This system has 1Gb of memory. Without any apache instances running, the usage would be less than the projected 290Mb – but that is outwith the bounds we expect to be operating in. From 2 httpd processes upwards, the average size and variation in size for each httpd process is very consistent – but since the variation in size is consistent that means the size of the total usage envelope will expand as the number of processes increases. The dashed red line is 2 standard deviations above the average usage, and hence there is a 97.5% probability that memory usage will be below the dashed line.

I want to have around 200kb available for the VFS, so here, my ServerLimit is around 175.

Of course the story doesn't end there. How do you protect the server and manage the traffic effectively as it approaches the serverLimit? How do you reduce the memory usage per httpd process to get more capacity? How do you turn around requests faster and therefore reduce concurrency? And how do you know how much memory to set aside for the VFS?

For help with finding the answers, the code run here and more information on capacity and performance tuning Linux, Apache, MySQL and PHP....buy the book!

If you would like to learn more about how Linux Memory Management then this (731 page) document is a very good guide:

Running stuff in parallel is a great way to solve some performance problems. My post on long running processes in PHP on my other blog continues to receive a lot of traffic - but a limitation of this approach (and any method which involves forking) is that it is hard to collate the results.
In the book I recommended using the curl_multi_ functions as a way of splitting a task across multiple processing units although I did not provide a detailled example.
I recently had cause to write a new bit of functionality which was an ideal for the curl_multi_ approach. Specifically I needed to implement a rolling data quality check checking that a few million email addresses had valid MX domain records. The script implementing this would be spending most of its time waiting for a response from the DNS system. While it did not have to be as fast as humanly possibly, the 50 hours it took to check the addresses one at a time was just a bit too long - I needed to run the checks in parallel.
While the PHP Curl extension does resolve names in order to make HTTP calls - it does not expose this as a result, and the targets were not HTTP servers, so I wrapped the getmxrr() function in a simple PHP script running at http://localhost.
To refresh my memory on the parameters past and the values returned I went and had a look at the PHP documentation.
The example of how to use the function in the curl_multi_exec() page is somewhat Byzantine:

do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {
    if (curl_multi_select($mh) != -1) {
        do {
            $mrc = curl_multi_exec($mh, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);
    }
}

OMG - 3 seperate calls to curl_multi_ functions in 3 loops!

It doesn't exactly make it obvious what's going on here. It turns out that the guy who wrote it has since posted an explanation.

There are certain advantages to what the developer is trying to do here, but transparency is not one of them.

The example code in curl_multi_add_handle() is clearer, but somewhat flawed:

do {
    curl_multi_exec($mh,$running);
} while($running > 0);

To understand what's really happening here, you need to bear in mind that the curl_multi_exec() is intended for implementing asynchronous fetching of pages - i.e. it does not block until it completes. In other words the above will run in a tight loop burning up CPU cycles while waiting for the responses to come in. Indeed it may actually delay the processing of the responses!
Now curl_multi_exec has a lot of work to do. For each instance registered it needs to resolve the vhost name, carry out a TCP handshake, possibly an SSL negotiation, send the HTTP request then wait for a response. Interestingly, when testing against localhost, it does nothing visible on the first invocation while it seems to at least as far as sending the HTTP requests on the second iteration of the loop, regardless of the number of requests. That means that the request has been dispatched to the receiving end, and we can now use our PHP thread to do something interesting / useful while we wait for a response, for example pushing the HEAD of your HTML out to the browser so it can start fetching CSS and (deferred) Javascript (see section 18.11.1 in the book).
Of course, even if I were to confirm that the TCP handshake runs in the second loop, and find out where any SSL handshake took place, there's no guarantee that this won't change in future. We don't know exactly how many iterations it takes to dispatch a request, and timing will be important too.
But it might be why the person who wrote the example code above split the functionality across the 2 consecutive loops - to do something useful in between. However on my local PHP install, on the first iteration through the loop it returns a 0 and CURLM_CALL_MULTI_PERFORM is -1. So the first loop will only run once, and won't send the requests (I tested by adding a long sleep after the call).

Hence I suggest that a better pattern for using curl_multi_exec() is:



do {

        curl_multi_exec($mh, $active);

        if ($active) usleep(20000);

} while ($active > 0);

The usleep is important! This stops the process from hogging the CPU and potentially blocking other things (it could even delay processing of the response!).

We can actually use the time spent waiting for the requests to be processed to do something more useful:

$active=count($requests);


for ($x=0; $x<=3 && $active; $x++) {

        
curl_multi_exec($mh, $active);
        // we wait for a bit to allow stuff TCP handshakes to complete and so forth...

        usleep(10000);

}

do_something_useful();

do {

        curl_multi_exec($mh, $active);

        if ($active) usleep(20000);

} while ($active > 0);

Here the executions of curl_multi_exec() are split into 2 loops. From experimentation it seems it takes up to 4 iterations to properly despatch all the requests - then there is a delay waiting for the request to cross the network and be serviced - this is where we can do some work locally. The second loop then reaps the responses.

The curl_multi_select function can also be called with a timeout - this makes the function block, but allows the script to wake up early if there's any work to do...


$active=count($requests);
$started=microtime(true);


for ($x=0; $x<=4 && $active; $x++) {

        curl_multi_exec($mh, $active);

        // we wait for a bit to allow stuff TCP handshakes to complete and so forth...

curl_mutli_select($mh, 0.02)

}

do_something_useful();

do {

        // wait for everything to finish...

        curl_multi_exec($mh, $active);

        if ($active) {

curl_mutli_select($mh, 0.05);
use_some_spare_cpu_cuycles_here();
}

        // until all the results are in or a timeout occurs
} while ($active > 0 && (MAX_RUNTIME<microtime(true)=$started;);

One further caveat is that curl_multi_exec() does not check for the number of connections to a single host - so be careful if you are trying to send a large number of requests to the same host (see also 6.12.1 in the book).

Did it work? Yes, for up to 30 concurrent requests to localhost, the throughput increased linearly.

LAMPe2e

Monday 16 March 2015

Accurate capacity planning with Apache - protecting your performance

Monday 2 March 2015

Making stuff faster with curl_multi_exec() and friends