Thursday 28 March 2024

Re-thinking OOM Killer

Swap is a performance killer for web applications. Once a component in your stack goes into swap it takes a very long time to recover. If you don't have the capability to move the traffic elsewhere quickly, restarting the service is often the best approach. Given the price of RAM, there's little reason not to add more as a long term fix. Indeed in recent years I have stopped provisioning swap at all on hosts. It felt wrong at first. I felt the same way the first time I built a computer without a floppy drive. While I've never had a reason to regret abandoning floppy disks, I am now questioning whether swap might matter after all.

Enter OOM Killer.

Linux has a facility called memory overcommit[1]. It is predicated on executables asking for more memory than they will actually use, and the complications of actually tracking memory usage on a modern operating system. I've talked about the latter before[2]. Put simply, the OS pretends it has more memory (RAM and swap) than is actually available. It is enabled on most (all?) Linux distributions. And most of the time it really does have a positive impact. But in extremis it can can cause a lot of headaches. OOM Killer starts terminating executables when the OS realises that it can't satisfy all the promises it made to them about memory. And on a server host, that usually means terminating the one job the host is expected to do. It does not ask nicely.

If you go researching on the internet, you will find a LOT of articles recommending that you start playing with oom_score_adj e.g. this article on Baeldung.com [3]. OMG NO! This influences which process the OOM Killer will target first. It does not prevent the situation from arising. There might be an argument for maintaining admin access during an OOM event but if that's dependent on, for example, forking sshd, starting a new session, spinning up a shell then oom_score_adj is not going to help. The mechanism by which the OOM Killer chooses a victim is complex, so even if this were a valid approach, selecting the right values can only be sensibly done on a trial and error basis.

The approach I had been using up to now was 2-fold.

attempt to tune the application to stay within designated boundaries
Set the memory overcommit to a fixed amount (I use the ratio, but this can also be set in kilobytes) and try to tune that ratio to the correct amount. I have found a value of 20 = use (100+20)% of (RAM + SWAP) a good starting point on a host which has exhibited OOM Killer behaviour.

(but simply adding more RAM should always be considered!)

The problem with this is that it is still hit and miss. Either the ratio is too low or its too high. And when it's too high, you only find out when OOM Killer does its thing.

Finally I get to the point....

Swap is bad. But the system can tolerate a little bit of swap usage before performance takes a nose dive. Further, allowing the system to start swapping (just a little) means we can actually see what the peak memory usage was! We have a basis for predicting how much overcommit we should actually allow.

While the gap is narrowing, SSD storage is still 5-10 times cheaper than RAM. Further, in most corporate environments, adding or removing storage is a much more minor exercise than changing RAM.

So my revised strategy for responding to OOM Killer is:

Ensure the monitoring is set to alert when the swap usage increases above base level
Provision swap – around 50% of RAM size
sysctl vm.overcommit_memory=2
sysctl vm.overcommit_ratio=10

If you're seeing the system start to use swap and you can't slim down the application config then it's time to buy more RAM. If it looks like your system is not filling up its RAM, then increase the overcommit_ratio. Repeat until you start tickling the swap, then back off.

Job done.

[1] https://www.kernel.org/doc/html/latest/mm/overcommit-accounting.html

[2] https://lampe2e.blogspot.com/2015/03/accurate-capacity-planning-with-apache.html

[3] https://www.baeldung.com/linux/memory-overcommitment-oom-killer

Tuesday 8 August 2023

Taming the vfs

Found an interesting tool for tracking/controlling the disk cache today - https://hoytech.com/vmtouch/

Apache listen backlog

tl;dr : ss -lti '( sport = :http )'

I've previously advised that keeping connections in the ListenBacklog is cheaper than keeping them in swap. But until Apache starts refusing connections, there's not much visibility of the state. Today, I found a great article by Ryan Frantz which takes a bit of a deeper dive into how this works and how to measure the usage.

Friday 14 October 2022

Gunicorn: a blast from the past

Nowadays I don't usually configure any swap on my servers. When a web application starts using swap it is already failing - its not really providing any contingency / safety net. At least that's what I thought up until today.

I was looking at an ELK cluster built by one of my colleagues. Where saw this:

$ free -m
         total     used    free      shared buff/cache   available
Mem:     63922     50151   5180           1        8590       12907
Swap:    16383     10933   5450

Hmmm. Lots of swap used, but lots of free memory. And it was staying like this.

Checking with vmstat, although there was a lot of stuff in swap, nothing was moving in and out of swap.

After checking the value for VmSwap in /proc/*/stat, it was clear that the footprint in swap was made up entirely of gunicorn processes. Gunicorn, in case you hadn't heard of it, is a Python application server. The number of instances it runs is fixed and defined when the server is started. I've not seen a server like that in 20 years :).

On an event based server such as nginx or lighttpd, a new client connection just requires the server process to allocate RAM to handle the request.
With the pre-fork servers I am familiar with, the server will adjust the number of processes to cope with level of demand within a defined range. Some, like Apache httpd and php-fpm implement hysteresis - they spin up new instances faster than they reap idle ones - to better cope with spikes in demand.
Thread based servers are (in my experience) a halfway-house between the event based and (variable) pre-fork servers.

While the kernel is doing the job of ensuring that these idle processes are not consuming resources which could be better used elsewhere, it is perhaps a little over-zealous here. It will be more expensive to recover these from swap than it would be to fork an instance. But changing to a variable number of processes is not really an option here. If I start seeing performance isues when this application comes under load I'll need to look at keeping these out of swap - which unfortunately comes at the cost of reducing available memory for the overnight batch processing handled on the cluster.

Thursday 6 August 2020

Web standards never sleep

As we wait for ratification/adoption of HTTP3 news arrives of a new image format promising a 50% reduction in size compared with JPEG; AV1.

There's nothing more I can add to Daniel Aleksandersen's analysis published on ctrl.blog but to say that support is coming soon in Firefox and already available in Chrome.

When it will be available in Edge and Safari still seems a bi of a mystery - they took a lot longer to adopt webp, but since Microsoft's browser now uses a lot of Chrome under the hood, it might appear a lot quicker there.

libre-software.net has a test page with links to additional resources.

AV1 also does video.

Saturday 13 June 2020

HTTP2 and 421 errors

I am in the process of migrating an online B2B enterprise built on design patterns from a 1990's dial-up ISP to a more modern infrastructure. It has taken over 20 years of effort to build a network of services which are difficult to manage to and insecure. Replacing this, with no downtime, is something of a challenge. A really important stepping stone to the target architecture is getting every exposed service routed via a proxy. This makes it much, MUCH simpler for us to:

(re)route services within our network
Upgrade services to HTTP2
Provision certificates / manage encryption
Configure browser-side caching
Analyse traffic
...and more

Running a handful of BigIp F5s is unfortunately not an option, so my switchboard is a stack of Ubuntu + nginx.

Up till recently the onboarding exercise had gone really well - but then I started encountering 421 errors. My initial reading kept leading me to old bugs in Chrome and other issues with HTTP2. These are neatly summarized by Kevin as follows:

This is caused by the following sequence of events:

The server and client both support and use HTTP/2.

The client requests a page at foo.example.com.

During TLS negotiation, the server presents a certificate which is valid for both foo.example.com and bar.example.com (and the client accepts it). This could be done with a wildcard certificate or a SAN certificate.

The client reuses the connection to make a request for bar.example.com.

The server is unable or unwilling to support cross-domain connection reuse (for example because you configured their SSL differently and Apache wants to force a TLS renegotiation), and serves HTTP 421.

The client does not automatically retry with a new connection (see for example Chrome bug #546991, now fixed). The relevant RfC says that the client MAY retry, not that it SHOULD or MUST. Failing to retry is not particularly user-friendly, but might be desirable for a debugging tool or HTTP library.

However, I was able to reproduce the bug in the first request from a browser instance I had just started. Also I was able to see the issue across sites using distinct certificates. I was confused.

Eventually I looked at the logs for the origin server (Apache 2.4.26). I hadn't considered that before as I knew it did not support HTTP2. But low and behold, there in the logs, a 421 error against my request.

[Fri Jun 12 18:44:47.945706 2020] [ssl:error] [pid 21423:tid 140096556701440] AH02032: Hostname foo.example.net provided via SNI and hostname bar.example.com provided via HTTP have no compatible SSL setup

So disabling SSL session re-use on the connection from the nginx proxy to the origin resolved the issue:

proxy_ssl_session_reuse off;

This does mean slightly more overhead between the proxy and the origin, but since both are on the same LAN, its not really noticeable.

Although its something of a gray area, I don't think this is a bug with nginx - I had multiple sites in nginx pointing at the same backend URL. It would be interesting to check if the issue occurs when I have multiple unique DNS names for the origin - one for each nginx front end - if it still occurs there, then that is probably a bug.