|What a week.
If you're wondering what happened last week here's a quick rundown:
What we've learned:
- Problem: Failed DNS server and a failure of our backup service to take over as we had anticipated.
Solution: David has worked through everything with our DNS providers and sorted everything out, as well as adding a third level of redundancy courtesy of Albert Pascual.
- Problem: Degraded internet connection feed into one of our matched firewalls. This is out of our control and should never, in theory, happen. But Murphy rules. We have redundancy in our firewall setup to handle just such a case, but the redundancy depends on a connection failing, not simply degrading. In the process of diagnosing this David has discovered that our painfully expensive firewalls may not be able to handle as much load as we would like.
Solution: We're working with our provider and gritting our teeth and pricing replacements for our firewalls.
- Possible problem: Potential Windows Service Pack issues. Some of the problems coincided with a set of recent Windows updates being installed.
Solution: We've uninstalled these which seemed to speed things back to normal but we won't know until next high-tide, probably Monday morning PST.
- Problem: A new bottleneck. We increased the capacity of the database servers to the point where the webservers were no longer waiting on them. This allowed the webservers to push through enough pages to hit their CPU limit. Unfortunately when the servers max out they automatically cycle, which redirects load to another server, causing a chain reaction.
Solution: Adding more servers, relaxing the cycling rules and, of course, optimising code are all being done to alleviate this.
- Problem: Web server hardware failure. A couple of the servers simply rolled over and died. We've grown the site from home-built white boxes and are rapidly replacing everything with HP and Dell servers. Building machines to specific specs allows you versatility and massive savings, but there is a point where speccing, building, testing and maintanence outweigh the costs of buying something built and tested for a specific purpose.
Solution: New brand-name, warranty'd servers. HP servers are butt-ugly, but they do the job.
- Problem: Load balancing issue. I can't, for the life of me, understand why load balancing is such a hassle. Windows load balancing works OK - not great, but OK. Until you try and mix Windows 2000 and Windows 2003 boxes: they don't like to play together. So we've used our firewalls for load balancing and no matter what setting we use - random, round robbin, weighted-least-connections - we get clumping whereby one server gets too many connections while others are starved. This in itself is not the main problem. The main problem is IP affinity, whereby a user is stuck on a single server instead of being shunted around the server farm. So, if you are stuck on a fast server it's great. However, if you're stuck on a slow server then no matter what you do your in for a slow ride. Refresh the page, close your browser, use a different browser - it doesn't matter. You'll always end up on that server until it cycles.
Solution: We've disabled hardware load balancing, standardised on a single platform and reenabled Windows load balancing.
- When it rains it pours
- If you are told "this can't happen", it will. Murphy's Law rules. Keep an eye on services that you aren't in control of and make sure you talk to your providers about how, exactly, they handle failure events. You might get scary answers that challenge your assumptions.
- Hardware fails
- Test service packs
- There is a huge opportunity for someone to come up with a simple, integrated web farm management package that handles monitoring, cycling and load balancing.
CodeProject.com : C++ MVP