Short version
-------------------------------------
Load Balancer is up and running. Some people may have to wait a few hours for dns to update completely, although this should be < 10% of users.
Long Version
--------------------------------------
A long time ago in a galaxy far far away, I set out on a journey to rid the world of dust mites... Hmm, that's not right. Now that I think about it, it was about 9 months ago in Philly, it was more of a project than a journey, and getting a load balancer has nothing to do with dust mites. Ah well, so much for a good intro.
Anyway, in the fall of last year I talked to Tom about getting a load balancer for the site. For those of you who aren't familiar with such a product, I'll give a brief explanation. When a website has multiple webservers, there are several ways to distribute the traffic, or load. The simplest and most basic way to do this is with "round robin dns". In a RRDNS configuration, the ip's of several webservers are listed under the same DNS name. For example, previously you'd see something similar to the following:
# nslookup newgrounds.com
Name: newgrounds.com
Address: 127.53.32.232
Name: newgrounds.com
Address: 127.69.132.47
Name: newgrounds.com
Address: 127.231.39.49
Name: newgrounds.com
Address: 127.39.130.90
Name: newgrounds.com
Address: 127.74.38.203
Name: newgrounds.com
Address: 127.100.22.58
Name: newgrounds.com
Address: 127.240.28.22
Name: newgrounds.com
Address: 127.190.36.14
When your computer would connect to newgrounds, it would randomly select one of those IP's to connect to. As I said before, this solution is free to implement and very easy, however it has two major problems. First, if any one of the servers goes down or gets taken down for maintenance, a certain portion of traffic (1/8 in this example) would be going to a dead server and updating the DNS tables to remove the server can take hours. Second, anytime something is done "randomly", bad things can happen. In this example, every now and then too many people would "randomly" select the same address and overload the server. As a result, even though there was plenty of spare capacity on the other servers, whoever went to the overloaded one would receive an error. Obviously, this configuration is far from ideal.
The alternative to RRDNS is to use a load balancer. What the load balancer does is direct traffic to the appropriate server behind it. So, instead of having multiple IP's allowing people to connect directly to the webservers, you have a single IP which routes traffic to the least loaded server. This configuration has only one major problem; it's basically a single point of failure. That being said, it's generally more reliable than any software based server and is easily replaced if something bad were to happen; and since it eliminates both of the problems associated with RRDNS configurations, the overall result is a much more robust and capable website. Clearly, this is the way to go.
The idea was agreed upon and the wheels were set in motion. I've already written too much so I'll try to summarize the next 9 months with an analogy. Think of the project as a program where at the end is an "install Load Balancer" command with an endless list of nested if/then else statements before it. Now, imagine if all of those if/then statments resolved to "true". For example, if network switch is out of ports, replace network switch. Or, if fileserver is overloaded, replace fileserver. Or my personal favorite, if salesman misspoke about capabilities of initial Load Balancer, spend months returning the original and re-planning/re-ordering for the better model.
So here we are in the Spring of 2005 and I'm finally getting this thing up and running. I'm sorry it's taken so long and hope that this eliminates many of the connection problems the user community has had over the years. This was a major step but don't think we're done improving the site, it's constantly evolving (ordering up a 64bit DB server very soon). You may still have problems getting to the site from time to time, although these should be at off-peak times and related to maintenance rather than to an overloaded server. Now we just have to wait for the next Numa Numa to come along for a real test :)
Tim out