Thanks Chris and to the rest of the CP team, for the continual performance improvments. I haven't contributed much here yet, but I do know I'd go nuts if I couldn't relax in the lounge after [or during] a tough day at the grindstone.
After what seems like weeks of testing, meetings, research, fire-fighting, brawls, quick fixes, and major reworkings we're getting to the business end of site upgrades.
We made the decision to stop trying to keep up with site expansion with incremental upgrades and instead let things ride for a few months while we undertook a major overhaul of our systems. This includes:
1. Upgrading all servers to be running the most efficient platform (in this case Win 2003)
2. Optimising load balancing, session management and network topology
3. Upgrading the physical hardware where necessary
4. Optimising the database schema and queries
5. Moving the ASP legacy code to ASP.NET
6. Ensuring that application logic is efficient
7. Testing. Go to 6.
Before you do anything you need to know where you are. Test. You may find something so blindingly obvious, so simple, that a month of upgrades turns into a day of tweaking or bug fixing.
Step 1 is important, but high load sites ran on Win NT and Windows 2000 for years. Windows 2003 does make things better in terms of process recycling and HTTP compression but we have our own in-house health monitoring processes that can kick a system when its failing, and we use Port 80 Software's HttpZip for compression.
Step 2 is a matter of juggling: you want load to be spread across server but you need to ensure session state is maintained and that in doing so you don't introduce unnecessary chatter across the network. We used to use Windows load balancing and enforce IP affinity so that when a user hit the site they were guaranteed to always hit the same server in any given session. This meant Session state could be managed using the ASP session object. In practice, however, we saw that users tended to cluster on one or two servers leaving the remaining running under capacity. We turned off IP affinity so that requests were spread across different servers within a single session, and rewrote the session handling to use SQL server. With a corresponding session object for ASP.NET we can now work concurrently with the ASP state, meaning we can mix and match ASP and ASP.NET pages with zero fuss. (remind me to post the classes as an article).
The next step is to remove WLBS from the equation and use our firewall for load balancing, freeing up network chatter. Once that is done then step 5 (moving to ASP.NET) will allow us to move seamlessly over to ASP.NET session state removing SQL load and giving a useful perf boost.
Step 3, upgrading the hardware, is a constant battle. We want to scale out but have resigned ourselves to scaling up as much as practical and then relying on the remaining steps to see us through. Hardware is cheap, but it's not a simple equation. There's definitely a sweet spot where two boxes are cheaper and more powerful than 1 single box. But that's double the cost of licences plus man hours in reworking your system to support clustering or replication or partitioning.
For our part we're looking at good, solid hardware for the SQL box. 64 bit processrrs so we can take advantage of new technology, fast SCSI drives with separation of data and logs, and lots of RAM. But again, RAM is cheap. A version SQL server that can handle lots of RAM isn't.
Step 4 is, to me, the second most important step. If your data is stored inefficiently, or if you ask the server for more than you need, or ask in a circuotous round-about fashion that just bogs the server down with needless work then everything's slow. Allow the database to find its data properly, ask it to return as little data as possible, take advantage of connection pooling and keep an eye on locks. And: don't get hung about normalising everything.
Let me give you an example: We have a messages table and a membership table. The messages table has a field that holds the member's ID, and the member's table holds the name and email of the member. To get messages we could do this:
SELECT Messages.Subject, Members.Name <br />
FROM Messages <br />
INNER JOIN Members ON Members.ID = Messages.MemberID
But we have 1.6 million members, and over a million messages posted. That's a big, bad join. So instead we denormalise and store the Member's name, at the time of posting, in the messages table to get:
SELECT Messages.Subject, Messages.Name FROM Messages
The data is old, but it doesn't matter. At the very worst we can have a background process go through and update the Messages.Name field based on changes to Member.Name.
The golden rule is that data is always old. Denormalise and cache where you can get away with it.
Step 5, upgrading to ASP.NET, is more for the convenience of being able to plug in new features easier, as well as getting the benefits of ADO.NET over ADO. It also allows us to rework many of the basic algorithms we're using in order to use and access data more efficiently and better manage caching.
ASP.NET runs faster than ASP, but we're not webserver limited, we're database limited.
It's step 6 that is the most important. It doesn't matter how fast your hardware is, or how clean your database schema, or how much optimisation you do. If, fundamentally, you're approaching the problem from the wrong angle then this could overshadow all other efficiencies made, time spent and money burned.
The best way to make a database run faster is to stop asking it dumb questions. Do you really need all that data? Do you really need to expose functionality that may be nice, but is horrendously expensive? Do you really need to keep asking for the same data or can you cache it on the webserver? Do you need to get data one piece at a time or can you save up your queries and get the response back in one hit?
A great example of the benefits of this is caching for the forums. We estimate the read to write ration in the forums to be around 100:1 to 1000:1, so over the last 2 days I implemented quick n' easy ASP caching system (again, remind me to post this) that allows a forum to cache the first N messages, only clearing this cache after 10 minutes or after a post is added or deleted. Previously at extremely high load it could take over 10 seconds (or result in a timeout) to view the first page of a forum. With the change it now took 0.1 seconds.
Make sure the manner in which you approach a problem is sensible and that you haven't overlooked something obvious. Make sure you allow the systems handling your applications to have the best chance possible at running efficiently. Where possible and practical, take advantage of systems such as inbuilt connection pooling, efficient load balancing, session state, compression, web caching (both client and server), data caching (both web server and database server).
If you do this then you're most of the way there. You can get easy perf boosts if you have the opportunity to upgrade server hardware and use the latest server software. Once you've done the obvious then test and profile and then dig in and attack the bits that are broken.
Chris Maunder wrote: The golden rule is that data is always old.
It generally takes people a very long time to realise this. A lot of the features of classic ADO, that people complain are missing in ADO.NET, such as client-side cursors and dynamic cursors are attempts to get 'the latest' data. Unfortunately the new data isn't what was there when you performed the query, so you can sometimes get results that actually don't match your criteria.
Data is old. The challenge is to decide how old data is allowed to be. If you can cache data and batch up requests to refresh the cache, you can generally improve your query performance.
Not a lot. Well, not a lot that you would be able to tell.
Lots of profiling, combined with lots of wailing and gnashing of teeth over perf issues.
Lots of navel gazing and planning for next year
Lots of boring business type stuff like arguing with billing agencies, doing budgets and forecasts, timelines, apportioning roles and responsibilities and gazing longingly down the TODO list at some really cool things I want to do
Fighting with source control. This will be the death of me
Lots of documentation. LOTS. You only realise how much you've neglected something when you confront it and start dealing with it. Code documentation, planning documentation, reviews, responses, research. Phew.
Legal. A topic I've had a lot of email about lately and hopefully one I can put to rest very soon. Learning legalese is like learning a really annoying language that has no macros, no shortcuts, and no compiler error checking.
Backend utilities to help us do things better.
Some work with Microsoft. It's a bit of a tug of war, and long term, but we're hoping the end results will be worthwhile for everyone.
Unfortunately we're at the point where a ton of behind the scenes work still needs to be done. Processing bounced emails, improving database schemas, analysing logs and cleaning up and improving content attribution.
Our choices today, and for the next few weeks:
1) Make the site go faster
2) Do Really Cool Stuff that will make you go "ooh! aah!"
3) Make things work better for us here trying to run things
Clearly we want to do all 3, but at the moment our priorities have to lean on the side of being sensible. There is some leeway in the order we do (1), though, so I'm going to put a little more work into (3) in order to roll out bits of (2) while ensuring the focus is firmly on (1).
If I was the project lead, then the obvious choice is number 3.
Any changes you can make to improve the development team's ability to develop will have a large impact on the other two options.
Without seeing the actual feature request list, making a choice between doing 1 and 2 after 3 has been completed is a little harder. Personally, the site already has all the features I need and probably meets the requirements of a vast majority of the users. So making the site go faster would seem to be a more important task. The faster the site runs, the more people will be able to use the site, the more ad impressions will be generated, the more money you can make for implementing option 2.
So today's fun was in trying to debug some ASP.NET. It
s fairly simple stuff - I have a project that creates a middle-teir library, and a site in a separate project that uses that library. Everything's under the one solution. I have a reference in my website to the library's assembly and ensure that all the usual stuff like "Local copy" is set to true and have debug on.
But no matter what I do, VS.NET refuses to copy the pdb file over to the /bin directory of the website. I can copy manually but even then the debugger isn't picking up the symbols and so I can't set breakpoints or step through code.
I've tried everything, read every newsgroup, every blog, and every KB articles I could fine. I did it all. The only way out of this is to delete the reference then readd it and it works. Until I recompile the component. ie every 2 minutes.
The othe fun bit is that running the debug build of the application works perfectly. But if I run via the debugger it gets caught in an endless loop. I step through the code and see that it's jumping from one instruction to a completely unrelated function for no reason. Something is seriously busted.
But on a brighter note Paul's logfile analyser is essentially done bar some fit and finishing, meaning we have one of the pieces in place to countback article downloads and include download figures.
Also, Nish's work on our improved syntax colouring component based on Troy Marchand's gem seems to be doing the trick
No matter how good software companies make IDEs, or how safe and easy to program the underlying platform, we're always going to have to deal with device manufacturers and their applications that never, ever work properly.
I've got an Ericsson 610 and it's a little dodgy when it comes to syncing up with Outlook. It works, kinda, but it's never been the most painless process. At the moment I have 2 copies of all my contacts on my phone and I've hit the point where if I fiddle any more it's going to explode.
I've just upgraded (and I say that in the loosest possible manner) to a Nokia 6230. This thing has everything: MMC card, FM radio, MP3 player, video camera, bluetooth. But it will not, for the life of me, connect to my laptop to sync via bluetooth.
I've worked around the dodgy Belkin PCMCIA bluetooth card and arm wrestled it into submission. As long as I stay two feet away from the laptop while it's running, and as long as I don't stare at it directly, it usually won't cause a blue screen.
I've installed the latest Nokia connectivity software. The one that looks like it's flash driven. The one where you click a button for the connectivity dialog and that dialog appears underneath the main window (where'd ya go? I'm gonna get ya! Iiiiiii'm gonna get ya!). But even though I've paired the phone and laptop, and can connect to the phone through XP and see the files on the phone, and everything seems to be fine, the Nokia software mournfully concedes "Cannot use this connection type. Check that all needed hardware, software and drivers are available".
Hardware: laptop in corner sulking. Phone next to me, being painfully cheery but a little useless.
Software: Downloaded and installed all the latest go fast bits for the laptop. The phone continues to be cheery but a little useless.
Drivers: See "software" above.
So yet again, a less than spectacular hardware interface experience. Man oh man...
So today I had 3 SQL servers die on me. Two servers set up in a testing environment in our office, and one backup server that is sitting idle in the hosting facility. And we don't know why they died...
The mystery started when Clinton wanted to formalise our new testing, staging and deployment process which not surprisingly required a test rig. We have one setup but haven't used it for a month due to development being at that stage where we're in between doing patches on the current system, revisiting groundwork on the new system, and cataloguing perf issues on the live system.
So they've been sitting there idle until I tried to resync the test SQL servers with a copy of the latest production DB so we could have real world data with which to test. SQL1 had reported issues a couple of days ago, but nothing too worrying. SQL2 was hale and hearty (AFAIK) so I moved to it first. It was dead. Stone cold don't-even-think-about-trying-to-boot-me dead. Weird. So back to SQL1. A blank, gray cold screen of rigor mortis was all that was to be seen. Dead, too, but in a blank next-world staring kind of way. The bodies have been removed and the authorities informed.
And then the hosting facility. Like some sordid B-grade mystery I logged into the network there to fire up SQL #3 in order to partition out some data access and spread the love. I mean load. It looked fine. It was a walking, talking SQL box but with a few little nervous twitches that I put down to too much sleep (on its part) and not enough caffeine (on my part). I installed, I patched, I created the tables, defined the stored procedures and added the logins. All well so far, but then no sooner do I walk out the metaphorical room then we have another dead body littering the parlour floor. Not even a monogrammed glove or heiroglyph'd card to give a clue as to the perpetrator's identity.
Have you upgraded to three yet? You need to. Really. I'm still trying to find a way to fit a fourth on my desk. I think the phones can go - I have too many anyway and still can't tell which one is ringing.