Advanced Web Server Fail-Over & Load Balancing Check

Basic Fail-Over

By default A-Team Systems' front end load balancers (using NginX) keep track of which back end servers are functioning by seeing what HTTP code is returned when it processes a user driven request.  In general terms, if a web server is returning 200 (OK) codes with its pages, it believes the peer is functional.  If it returns a 500, 501, 502, or 503 it believes the server is having problems, will mark it down internally, and route traffic to a different one.

This works well in theory, however in practice several issues arise:

  • If a single broken page returns a 50x error, even if the rest of the site is functional, NginX will mark the peer as down and retry the request on another peer.  Because the page itself is broken, this will again return a 50x error and demote that peer.  This will continue until all back end servers are marked down and then all traffic will return an error to the user because NginX thinks everything is down.  In this way a single broken page or request can bring an entire site down accidentally.
  • The reverse can be true as well, where a few functioning pages (returning 200s) can make NginX think a back end peer is functioning when it really isn't.
  • With this built in fail-over mechanism it is not possible to monitor what NginX thinks the state of the back end peers are in, making troubleshooting the above difficult when it occurs.
  • It relies on user requests to see if a back end server is up, and only checks when a user is waiting for a request.  This makes it impossible to monitor how the load balancer sees the back end peers.

Advanced Fail-Over

To address these issues A-Team Systems has worked with the makers of the nginx_upstream_check_module and the FreeBSD ports team to add it to the FreeBSD NginX port/package.  This in turn means its available to our clients (and anyone else running FreeBSD).

This module addresses the above issues by:

  • Specifying a single URL to test which is used as the authoritative state of the peer instead of using any request.
  • Performing the peer testing every few seconds instead of waiting for a user to request a document (and delaying a response because a peer is down).
  • Tracking and reporting the state of the peers via a special status page which our monitoring automatically tracks.  If a peer is failing, even intermittently, we're able to see it.

Creating A Fail-Over Test Page

The upstream check module is very simple and uses the HTTP status code to see if a peer is up or down.  It does not have the ability to check the content of a page, all it uses are one of these two conditions:

  • Return a HTTP 200 code if the peer is operational.
  • Return anything else (ie; HTTP 503) if the peer is not functioning correctly.

Typically you'll want to include the following checks in your script:

  • Database connectivity checks (including MySQL, Redis, etc.)
  • Caching connectivity (ie; Memcache if used).
  • Using error_log() to log failure conditions for troubleshooting.

Remember, only include things that are absolutely required to function because if they're not available to all the peers the site will be down entirely (since NginX will think there are no operational peers).  As such this is not the place to do complete nuanced monitoring, just check if the web server is functioning enough to handle user requests at a basic level.

Things to avoid in your check script:

  • External curl() calls to remote servers.
  • API or other calls, typically this is better error-handled on the requesting page (which will not disable the entire peer if it errors).
  • Anything that depend on a resource which is outside the server cluster (ie; a 3rd party API/service, a remote server, etc).
  • Anything load intensive as these checks run every two to three seconds on all peers.
  • Anything that takes more than a half a second to complete.
  • Anything that locks a resource.
  • Anything that writes to a file, log file or database.

Returning a 503

With the above in mind create your script as you see fit.  To Return a 503 in PHP:

// Detect Protocol
$protocol = "HTTP/1.0";
if ( $_SERVER["SERVER_PROTOCOL"] == "HTTP/1.1" )
  $protocol = "HTTP/1.1";

// Return Error
header("$protocol 503 Service Unavailable", true, 503);