If you're wondering what happened last week here's a quick rundown:
Problem: Failed DNS server and a failure of our backup service to take over as we had anticipated.
Solution: David has worked through everything with our DNS providers and sorted everything out, as well as adding a third level of redundancy courtesy of Albert Pascual.
Problem: Degraded internet connection feed into one of our matched firewalls. This is out of our control and should never, in theory, happen. But Murphy rules. We have redundancy in our firewall setup to handle just such a case, but the redundancy depends on a connection failing, not simply degrading. In the process of diagnosing this David has discovered that our painfully expensive firewalls may not be able to handle as much load as we would like.
Solution: We're working with our provider and gritting our teeth and pricing replacements for our firewalls.
Possible problem: Potential Windows Service Pack issues. Some of the problems coincided with a set of recent Windows updates being installed.
Solution: We've uninstalled these which seemed to speed things back to normal but we won't know until next high-tide, probably Monday morning PST.
Problem: A new bottleneck. We increased the capacity of the database servers to the point where the webservers were no longer waiting on them. This allowed the webservers to push through enough pages to hit their CPU limit. Unfortunately when the servers max out they automatically cycle, which redirects load to another server, causing a chain reaction.
Solution: Adding more servers, relaxing the cycling rules and, of course, optimising code are all being done to alleviate this.
Problem: Web server hardware failure. A couple of the servers simply rolled over and died. We've grown the site from home-built white boxes and are rapidly replacing everything with HP and Dell servers. Building machines to specific specs allows you versatility and massive savings, but there is a point where speccing, building, testing and maintanence outweigh the costs of buying something built and tested for a specific purpose.
Solution: New brand-name, warranty'd servers. HP servers are butt-ugly, but they do the job.
Problem: Load balancing issue. I can't, for the life of me, understand why load balancing is such a hassle. Windows load balancing works OK - not great, but OK. Until you try and mix Windows 2000 and Windows 2003 boxes: they don't like to play together. So we've used our firewalls for load balancing and no matter what setting we use - random, round robbin, weighted-least-connections - we get clumping whereby one server gets too many connections while others are starved. This in itself is not the main problem. The main problem is IP affinity, whereby a user is stuck on a single server instead of being shunted around the server farm. So, if you are stuck on a fast server it's great. However, if you're stuck on a slow server then no matter what you do your in for a slow ride. Refresh the page, close your browser, use a different browser - it doesn't matter. You'll always end up on that server until it cycles.
Solution: We've disabled hardware load balancing, standardised on a single platform and reenabled Windows load balancing.
What we've learned:
When it rains it pours
If you are told "this can't happen", it will. Murphy's Law rules. Keep an eye on services that you aren't in control of and make sure you talk to your providers about how, exactly, they handle failure events. You might get scary answers that challenge your assumptions.
Test service packs
There is a huge opportunity for someone to come up with a simple, integrated web farm management package that handles monitoring, cycling and load balancing.
As many are aware we've had a run of bad luck with some of the servers. Frequent "500" server errors which cause the blood pressure to rise. A 500 error can be raised either by a bug in the code behind a site, or by IIS / ASP.X crashing. The ones that we've been experiencing are mostly the later: ASP falls over and can't get back up and from then on all it will serve are 500 errors.
The question is: why? Why would ASP fall over? Partly because it's not robust, sometimes because the load on it is so high it simply can't take it, and sometimes, as in our case, because of faulty hardware.
So in answer to "why were we having so many 500's last week" the answer is "because our hardware failed". A HDD was failing intermittently on a single server and that was enough to cause pain for a percentage of our page views. It failed enough to cause problems, but not consistently enough for for the red flags to be raised. In hindsight we should have stopped combing through code, IIS settings, and load balancing to try and find the problem and checked the hardware (since all machines have the same software setup) but as we all know there's nothing quite as accurate as hindsight.
Upgrading to bigger, better, "more stable" systems should be easy. It should be something that, once the work of physically making the changes has been done, just works. After all, the new systems are, by definition, bigger, better, and more stable.
But life isn't like that.
We're in the process of upgrading our servers from Windows 2000 to Windows 2003. Yes, it would have been nice to have done this move sooner but a number of issues have delayed us, including cost of upgrading hardware, cost of new licences, and the most finicky of them all: problems with Windows 2003.
Not problems as in "Windows 2003 doesn't work". Windows 2003 is more stable, IIS6 is way better, and security is tighter. Unfortunately code and components that had run fine in 2000 can fail in 2003. One of the issues we had was an ISAPI filter that died because of the changes made to the way websites are identified within IIS6. Another issue we had was some .NET 1.1 code that worked as smooth as the proverbial baby's bottom on 2000 suddenly rolled over and died in 2003. Not immediately, not with any useful error messages - it just sent IIS into a spin and bought down the site.
So we're being a little tentative with our migration plans. Where we can we will move away from ISAPI filters and use HttpModules. We've abandoned the tried and true .NET 1.1 and are doing all development in .NET 2.0 beta 2 (if tried and true doesn't work then why not just go crazy and use bleeding edge technology, right?). We are moving one server at a time over to the new OS and codebase and some machines are happy, some are not. The ones who aren't typically are older and so are being retired and new machines purchased to replace them.
So today sees the deployment of the first batch of new 2.0 beta 2 code. There's a chance you've already unwittingly stress tested it for us (though you'd be hard pressed to spot it). If things continue as well as they have so far then we'll move over the rest, piece by piece.
So why the new features? Why upgrade old code instead of moving forward with fresh infrastructure?
Because it's fun.
Actually the answer is a little more complicated than that. We are continually planning and introducing upgrades to both software and hardware to keep up with the increased load. As we upgrade one piece, another piece becomes a bottleneck, so our planning usually runs along the lines of "what's the minimum that will make worthwhile a difference, and what other pieces will be affected".
The current series of upgrades will involve some fairly dramatic purchases and will also involve partnering with other companies in order to bring in specific expertise. In order to do this, though, there's the age old problem of finding a credit card big enough to lump it all on, so much of the work in the past few months has been boring, painful but critical financial planning and costing.
The upgrades come in several parts: hardware - big, loud, humming lumps of beige; software - in the form of upgrading OS infrastructure and the code base; and appliances - dedicated boxes that will provide specific services such as caching.
The planning of the hardware and appliances are pretty much done. It's just a matter of manpower and dollars to get it all implemented. The software side of things is a little different.
A lot of work has been done in investigating different methodologies, different frameworks, different features. My TODO list not only lists over 150 Things That Must Be Done, but there are also around 70 Things I Would Love To Do wishlist items.
Planning, testing, thinking and designing are important, but eventually you just want a taste of how things could be. To that end, some of the ideas that I've been thinking about for a while such as the changes to the discussion boards have been retro-fitted to the old code as part of a working prototype. It's far more fun to test on a live audience and get valuable feedback now than it is to wait till it's all done and unveil in one big lump.
Send me feedback, send me ideas, and don't be shy if you hate the changes. I might not agree, but I always listen.
So again we find ourselves in the hosting facility, screwdriver and large mallet in hand. Usually when we get an alert that the site is unresponsive it's due to something simple: a server having an aneurism, a switch collapsing due to fatigue, the Province of Ontario having a blackout. Today we rock up to the hosting centre wondering what the God of Intermittent Network Connections has dealt us today and we find half the flooring of the centre ripped up, cabinets open with miles of wires spilling out, and lots and lots of keen but slightly haggard network admins with crimping pliers running around doing...stuff. We're not sure what. We just know that for the last 15 minutes no one can get to the site.
The problem, it seems, isn't them (Wide eyed innocent looks all round, pliers hidden behind backs). But it's not us either. We can ping outside our cage into their equipment. They can go from their equipment to the outside world. But somewhere there's a break and no one can find it.
And then it fixes itself. But we can't see that it's fixed itself because our KVM switch (that thing that allows one keyboard/video monitor/mouse to hookup to multiple machines) has quietly expired.
So everything is again working perfectly. Redundant power, redundant firewalls, redundant connections, redundant servers. Not that we can actually see anything happening while in the cage.
Over the past few weeks performance has been the major focus of attention. The work was two-fold: upgrading hardware to ensure we're not limited in that regard, and reworking the code and database to ensure we use that hardware with the greatest efficiency. Our operations guy slipped in a nice new shiny database server that should have enough capacity to see us good for a while. The code upgrades were a bit more involved.
I've received a lot of helpful suggestions from many readers on their experiences, the things that worked for them and major gotchas. Sensible things like removing unnecessary queries, caching as much as possible, using firehose cursors, providing lock hints to minimise locking issues and making sure indices are correct. I would like to give special thanks to Andy Brummer for taking the time to analyse our SQL profile traces and make recommendations. This has been an on-going iterative process and the results have been impressive.
The initial step was to perform a SQL trace to see what queries were the most expensive and which could be reworked or removed. It may seem obvious but there's no use in removing or reworking something if it's not a problem. Likewise, if we find a query that is a big problem but an absolute necessity then at least we have a starting point from which to measure subsequent improvements.
After our initial tests I reworked the infrastructure as follows:
Removal of features that are high load
Viewing the messages posted by a given member, the birthday list, the number of new messages in each forum since last visit, the 'last page' button in the forums (fora?*), and removal of the link to the members list page in the lounge all contributed to a reduction in load. Some of these features were of interest to only a very few people so I felt justified in nuking them. Other features, such as the number of messages since the last visit are useful, but expensive. In the end a simple compromise was made whereby fora with new messages are listed in bold. It's not so much how many new messages have been posted, but whether that number is greater than zero.
As well, removing links to high load features discourages abuse of these features and reduces the number of site crawlers that inadvertantly hit a page. As a resource site, we are spidered and indexed frequently.
Each forum can now have its top N messages cached at the webserver level, and I can turn on and off caching for individual fora and change the number of messages cached on a per-forum level. Caching only works when the user has selected 'messages in the last 3 months' (since setting the date to anything else changes the messages to display, and the caching isn't smart enough for that). This setting is the default setting so spiders and crawlers will experience (and contribute) little load.
The caching is a generic webclass and as such it can be applied to other features. The number of members and number of articles on the homepage have also been cached alongside the other homepage items.
Lock hints were added to the article search query. Most of the data we need to display is not time-critical so if it's displayed 10 seconds out of date no one will notice. By adding local hints to queries we were able to reduce the number of lock waits.
At this point we had a nice boost. Many of the worst offenders had simply been removed. We had essentially torn out the bigger weeds from the garden and could now concentrate on more insidious creepy-crawlies. More traces were done and the results used to fine-tune:
A server-side cursor was hiding in the search page. This was converted to firehose and resulted in a a nice boost. Actually not so much a boost, but the removal of a slow point.
Important indices on many tables were lacking, even after careful and repeated use of the SQL Server Index Tuner. This has significantly reduced the number of massive table scans and so has reduced load spikes.
Reworking queries and adding indices go hand in hand. In my soul I'm a C++ programmer and sometimes forget that I'm not programming against a chip with a math-coprocessor, but against a SQL interpreter that has to first seek out the data on which to operate before doing the actual operations. A great example was a search for a specific forum type. ForumType is a bitmask and I wanted forum types 16 and 32. Hence:
SELECT * FROM Forums WHERE (ForumType & 48) > 0
However, this means every row in the Forums table needs to have the (ForumType & 48) operation applied, meaning a full table scan. Replace this with
SELECT * FROM Forums WHERE (ForumType & 48) > 0 and (ForumType >= 16)
And now you can take advantage of an index on ForumType and reduce the number of rows to which the (ForumType & 48) need be applied. This can result in a significant performance improvement.
Still to do are a number of important steps:
Reduce page and site features based on the visitor. If it's a search engine spider then expose the articles and information in a simple manner that reduces database load
Improve spider and stripper recognition. Recognised site strippers are already banished but most pretend they are IE and slip through.
Rework the table structure to minimise joins. This will involve careful denormalisation of some tables, while reducing flexibility in others. The point being that the flexibilty built into the current system isn't being used so it won't be missed, nor will the current performance hit caused by the current schema.
Reduce the physical size of tables by archiving old data or vertically partioning data into separate tables.
Offload some of the logging work to a separate database server to free up the database server for more immediate tasks.
*Yes, this is the correct pluralisation. Sounds a little weird, but who am I to argue
Chris Maunder wrote: SELECT * FROM Forums WHERE (ForumType & 48) > 0 and (ForumType >= 16)
This is why i'm sooo bad at SQL...
Chris Maunder wrote: # Improve spider and stripper recognition. Recognised site strippers are already banished but most pretend they are IE and slip through.
So... just out of curiousity... what kind of behavior is acceptable, or more to the point, what kind of behavior is likely to get banned? I'm guessing (based on experience) that reloading The Lounge every 3-5 minutes is ok... but is a script that pulls it down every 10 seconds pushing it? How about every 10 minutes? Is this already documented somewhere in big, bold, red text? And, not that i'm planning on wallpapering my house with an entire printout of CodeProject or anything like that, but which pages would you especially rather not be hammered on?
I'm not the Jack of Diamonds... I'm not the six of spades.
I don't know what you thought; I'm not your astronaut...
I have some questions and suggestions, just ignore them if you have already herd them.
I was just wondering, what version of SQL you are running? I assume it is MSSQL 2000. If that is the case, I strongly suggest you take a look at MSSQL 2005. It has transactional based replication that can be used with load balancing. As you probably know, this will allow you to just drop in another SQL server to increase performance. This replication is a lot better then 2000s. While I do not work on the DB end personally, I know we have had a lot of success with 2005 so far. It is a very stable Beta. Also, have you considered moving all [downloadable] files to a separate server?
We will absolutely move to SQL 2005 as soon as Microsoft releases it to manufacture. I didn't realise it handled replication with load balancing - finally! But above and beyond that there are general perf increases we'll get for free, and features such as being able to specify the first and last row to retrieve that will make code logic cleaner and reduce data crunching and transfer.
We've been looking at how to open up bandwidth for downloads while ensuring we don't overload the site itself. One thing at a time.
Thanks Chris and to the rest of the CP team, for the continual performance improvments. I haven't contributed much here yet, but I do know I'd go nuts if I couldn't relax in the lounge after [or during] a tough day at the grindstone.
After what seems like weeks of testing, meetings, research, fire-fighting, brawls, quick fixes, and major reworkings we're getting to the business end of site upgrades.
We made the decision to stop trying to keep up with site expansion with incremental upgrades and instead let things ride for a few months while we undertook a major overhaul of our systems. This includes:
1. Upgrading all servers to be running the most efficient platform (in this case Win 2003)
2. Optimising load balancing, session management and network topology
3. Upgrading the physical hardware where necessary
4. Optimising the database schema and queries
5. Moving the ASP legacy code to ASP.NET
6. Ensuring that application logic is efficient
7. Testing. Go to 6.
Before you do anything you need to know where you are. Test. You may find something so blindingly obvious, so simple, that a month of upgrades turns into a day of tweaking or bug fixing.
Step 1 is important, but high load sites ran on Win NT and Windows 2000 for years. Windows 2003 does make things better in terms of process recycling and HTTP compression but we have our own in-house health monitoring processes that can kick a system when its failing, and we use Port 80 Software's HttpZip for compression.
Step 2 is a matter of juggling: you want load to be spread across server but you need to ensure session state is maintained and that in doing so you don't introduce unnecessary chatter across the network. We used to use Windows load balancing and enforce IP affinity so that when a user hit the site they were guaranteed to always hit the same server in any given session. This meant Session state could be managed using the ASP session object. In practice, however, we saw that users tended to cluster on one or two servers leaving the remaining running under capacity. We turned off IP affinity so that requests were spread across different servers within a single session, and rewrote the session handling to use SQL server. With a corresponding session object for ASP.NET we can now work concurrently with the ASP state, meaning we can mix and match ASP and ASP.NET pages with zero fuss. (remind me to post the classes as an article).
The next step is to remove WLBS from the equation and use our firewall for load balancing, freeing up network chatter. Once that is done then step 5 (moving to ASP.NET) will allow us to move seamlessly over to ASP.NET session state removing SQL load and giving a useful perf boost.
Step 3, upgrading the hardware, is a constant battle. We want to scale out but have resigned ourselves to scaling up as much as practical and then relying on the remaining steps to see us through. Hardware is cheap, but it's not a simple equation. There's definitely a sweet spot where two boxes are cheaper and more powerful than 1 single box. But that's double the cost of licences plus man hours in reworking your system to support clustering or replication or partitioning.
For our part we're looking at good, solid hardware for the SQL box. 64 bit processrrs so we can take advantage of new technology, fast SCSI drives with separation of data and logs, and lots of RAM. But again, RAM is cheap. A version SQL server that can handle lots of RAM isn't.
Step 4 is, to me, the second most important step. If your data is stored inefficiently, or if you ask the server for more than you need, or ask in a circuotous round-about fashion that just bogs the server down with needless work then everything's slow. Allow the database to find its data properly, ask it to return as little data as possible, take advantage of connection pooling and keep an eye on locks. And: don't get hung about normalising everything.
Let me give you an example: We have a messages table and a membership table. The messages table has a field that holds the member's ID, and the member's table holds the name and email of the member. To get messages we could do this:
SELECT Messages.Subject, Members.Name <br />
FROM Messages <br />
INNER JOIN Members ON Members.ID = Messages.MemberID
But we have 1.6 million members, and over a million messages posted. That's a big, bad join. So instead we denormalise and store the Member's name, at the time of posting, in the messages table to get:
SELECT Messages.Subject, Messages.Name FROM Messages
The data is old, but it doesn't matter. At the very worst we can have a background process go through and update the Messages.Name field based on changes to Member.Name.
The golden rule is that data is always old. Denormalise and cache where you can get away with it.
Step 5, upgrading to ASP.NET, is more for the convenience of being able to plug in new features easier, as well as getting the benefits of ADO.NET over ADO. It also allows us to rework many of the basic algorithms we're using in order to use and access data more efficiently and better manage caching.
ASP.NET runs faster than ASP, but we're not webserver limited, we're database limited.
It's step 6 that is the most important. It doesn't matter how fast your hardware is, or how clean your database schema, or how much optimisation you do. If, fundamentally, you're approaching the problem from the wrong angle then this could overshadow all other efficiencies made, time spent and money burned.
The best way to make a database run faster is to stop asking it dumb questions. Do you really need all that data? Do you really need to expose functionality that may be nice, but is horrendously expensive? Do you really need to keep asking for the same data or can you cache it on the webserver? Do you need to get data one piece at a time or can you save up your queries and get the response back in one hit?
A great example of the benefits of this is caching for the forums. We estimate the read to write ration in the forums to be around 100:1 to 1000:1, so over the last 2 days I implemented quick n' easy ASP caching system (again, remind me to post this) that allows a forum to cache the first N messages, only clearing this cache after 10 minutes or after a post is added or deleted. Previously at extremely high load it could take over 10 seconds (or result in a timeout) to view the first page of a forum. With the change it now took 0.1 seconds.
Make sure the manner in which you approach a problem is sensible and that you haven't overlooked something obvious. Make sure you allow the systems handling your applications to have the best chance possible at running efficiently. Where possible and practical, take advantage of systems such as inbuilt connection pooling, efficient load balancing, session state, compression, web caching (both client and server), data caching (both web server and database server).
If you do this then you're most of the way there. You can get easy perf boosts if you have the opportunity to upgrade server hardware and use the latest server software. Once you've done the obvious then test and profile and then dig in and attack the bits that are broken.
Chris Maunder wrote: The golden rule is that data is always old.
It generally takes people a very long time to realise this. A lot of the features of classic ADO, that people complain are missing in ADO.NET, such as client-side cursors and dynamic cursors are attempts to get 'the latest' data. Unfortunately the new data isn't what was there when you performed the query, so you can sometimes get results that actually don't match your criteria.
Data is old. The challenge is to decide how old data is allowed to be. If you can cache data and batch up requests to refresh the cache, you can generally improve your query performance.
Not a lot. Well, not a lot that you would be able to tell.
Lots of profiling, combined with lots of wailing and gnashing of teeth over perf issues.
Lots of navel gazing and planning for next year
Lots of boring business type stuff like arguing with billing agencies, doing budgets and forecasts, timelines, apportioning roles and responsibilities and gazing longingly down the TODO list at some really cool things I want to do
Fighting with source control. This will be the death of me
Lots of documentation. LOTS. You only realise how much you've neglected something when you confront it and start dealing with it. Code documentation, planning documentation, reviews, responses, research. Phew.
Legal. A topic I've had a lot of email about lately and hopefully one I can put to rest very soon. Learning legalese is like learning a really annoying language that has no macros, no shortcuts, and no compiler error checking.
Backend utilities to help us do things better.
Some work with Microsoft. It's a bit of a tug of war, and long term, but we're hoping the end results will be worthwhile for everyone.