I don't want toys or new shiny things. I don't want Visual Studio Orcas or Windows Vista Server. I don't even want service packs.
I want versions of the Windows applications that I must use in my day to day work to be optimised. To be made faster. To not use so many system resources, so much memory, so much disk accessing and so many network roundtrips that they are rendered unusable while your machine digests their request.
SQL Server management console is currently the poster child (for me) of an application developed by someone who had the server in the room next door to them. Try bringing up a context menu on a server across a slow connection.
IE is the embodiment of inefficiency bought on by a "we'll use the same function for everything" mentality. Click on a new tab and you get a "Connecting..." prompt and enough time for a quick lie-down. Just bring me up a new window. Don't connect to anything. Don't load anything. Just the window please. Firefox does this task instantly.
Outlook is a great example where they forgot, in places, that there was a real live human trying to use the application. I typically read emails in the preview pane and then right-click to mark as read (or leave it be if I need to follow up). Sometimes I make the horrible mistake of right-click and selecting the option underneath, namely "Categories...". This can lock up outlook for minutes. Where's the multithreading? Where's the "cancel this incredibly slow operation you didn't mean to do" button?
Also: I get thousands and thousands of spam a day and they get kicked into my spam folder at the exchange server level. If I try and empty this folder I can't do anything else with Outlook. It's blocked.
Please: PLEASE do some housecleaning on Outlook for us poor souls who rely on it to handle thousands of emails.
I'm not even going to discuss Visual Source Safe. Microsoft: please just buy a new solution that works. Ionforge Evolution works like greased lightning for me across 14,000km (Australia to Toronto). Visual Source safe is unusable when outside the LAN.
Visual Studio 2005 is the final splinter under my thumb. Refactoring is not a feature anymore, it's a necessity. The refactoring in Visual Studio is so incredibly inefficient that it's essentially unusable. When you do find you have to use it it's always with that "Oh God no" feeling.
Similarly, Building a solution. Builds are a really, really old technology and the problem of efficiently building applications was addresses - I dunno - over 20 years ago? Why then is it that everytime I do a Build, Visual Studio takes such an inordinate amount of time to determine which projects should be rebuilt and which can be completely ignored.
We could all go on and on and on. I'm not angry, I'm just disappointed. What happened to pride in applications? To being careful with resources? To being clever with the way you do things so you save cycles or, even better, network roundtrips, unnecessary function calls and minimise disk access.
So here's what I really want: I want a new "VP Usability" job position at Microsoft. That person would make sure the applications had a consistent user interface, would ensure applications were simple to use, played nice with your computer, and were snappy. He would be given a 2GHz Core Duo laptop with 1Gb RAM and a 5400rpm HDD. He would attend all meetings with that and his swearing and cursing recorded diligently. He would berate sloppy, corner cutting programming and reward, with riches beyond their wildest dreams, exceptional programming. He would have style and class and absolutely zero patience.
And his first job would be to fire the entire WMP team.
I think we are seeing this in many applications. Astronaut architected "platforms" masquerading as applications. Outlook is not so much an email tool as an information broker platform these days. I'd prefer if they made it an email tool again. Firefox is OK now but the next version is going to have all sorts of "application platform" aspirations and I fear it will revert to the bloat that was Mozilla.
Great, it can do X, Y AND Z but guess what, it doesn't do A very well anymore and A is 90% of its purpose. It integrates with G? Great! But I don't have G. It makes me coffee? Fabulous! Except this other dedicated coffee maker makes better coffee so I'll stick with that, thanks.
There is a legacy of "features/capabilities == competitive" thinking. People think it is difficult to release a focused application that does one thing well. They fear the initial volley of blog/magazine/pundit remarks along the lines of "Those idiots forgot Feature K!" which does happen but, with good tools, is often followed by a dedicated group of users that are worth more than the billion use-once-never-again users.
Flash in the pan > long term effect.
And as much as I prefer Mac OS X to Windows these days, it is not perfect either.
It is difficult battling the Feature Demons though. You can be as rational as gravity but you'll come out battered by feature requests you know are superfluous and ultimately damaging to your users productivity. Too many are afraid of making a stand, of forging a path. Yet we have seen that those who do focus win out in the long run.
I don't think firing people is necessary but Microsoft in particular have lost vision, have lost a champion at the front to drive them forward. Apple are doing well off of Jobs and I'd like to see Microsoft, which has a much tougher challenge than Apple, find that person and vision again.
And with that, Paul closed his browser, sipped his herbal tea, fixed the flower in his hair, and smiled brightly at the multitude of cute, furry animals flocking around the grassy hillside where he sat coding Ruby on his Mac...
Clearly something's been happening behind the scenes so I figured a quick rundown of the main bits would make for interesting lunch time rading. Or something.
So here's a quick rundown of where we are:
After more than a year of upgrading and troubleshooting hardware we are at a stage where hardware is solid, our network topology sensible and our load bearable. We haven't bought a new server in months yet site performance is consistent. We are at a position where there is no need to push upgrades or increase capacity which means it is a perfect time to push upgrades and increase capacity. And because we now have the luxury of time we've gone back to basics and started from the ground up.
Every database, table and field in our SQL layer has been examined and refactored. Several hundred changes later and we now have a schema that brings together all the disparate modules that have been developed for CodeProject over time into something unified and extensible yet still modular. Strict attention was paid to keys, indexes, referential integrity, normalisation - even sensible names that reduce the chances of misinterpretations.
The next (and current) stage is taking a birds eye view of what the data layers are meant to provide the business layers, make realistic estimates on table sizes, result sizes and frequency of calling, and from there rework the queries and indexes to be optimised, overall, for our patterns of use. Some long running complicated queries are left alone since they are called once every 6 hours, whereas other queries are optimised and reworked in a slightly convoluted in order to eek uot more performance on something that is called constantly. In conjunction with all this judicious use of caching can do miracles.
Which then leads to the business layer which at the moment is confined to internal pages and applications. The data layer, base web and base standalone application frameworks are done. The CodeProject you see today is part dynamic ASP and part pre-cached data generated by backend services using the new CodeProject .NET framework. View the latest articles and the RSS feeds of the latest articles and you will, unfortunately, still see a difference in what one produces over the other.
For the most part current work on the business layer is confined to those parts required by our utilities. Having the ability to use this code in the web application as well means when it comes time to complete the various modules that make up CodeProject, 80% of the code will already be there. It'll be lego programming at its best.
The update to the presentation layer has been something we've been putting off for, well, ever, but with the speed of development increasing we found time to get this part done. The whole redesign has been discussed ad neauseum but on the whole we're pretty happy with it and happy with the feedback. It ties together the mishmash of different styles, gives us more room to move and, most importantly, works with the HTML generated by the new back-end systems.
As to what's next? We're not sure. We've got our laundry list of items to work through but there's definitely a spring in our step after such a long period of simply doing maintanence. New features in the forums and new features for our members are just too much fun to ignore so we may even delay some parts of our main work to take a side trip and throw new stuff out into the wild. With so much work being done on the backend it's hard not to want to let some of it pop its head into view a little prematurely.
There must be something fundamentally unique about the placement of Toronto because the number of bizarre, almost biblical plagues and curses that hit this town seems to me to be way above the statistical norm.
SARS, blackouts, more SARS, West Nile Virus, roads washing away, intersections caving in under sink holes, Bird Flu, the regular procession and recession of glaciers, winter smog, summer fog, debilitating cold, insanity producing humidity and I'm sure, any day now, rabid carnivorous squirrel zombies.
OK, so I made up the bit about glaciers in Toronto but it's so bizarre to me that not a day goes by without something, somewhere, being terribly wrong around here.
The latest? Alergies.
Sure, many people get the sniffles every spring or Autumn but it seems everywhere I look Torontonians have collectively turned into a drooling army of puffy eyed, coarse throated snifflers because the nervous and opportunistic vegetation and fungi have taken a peek at the sky and said "Make the most of it guys - we've got about 2 weeks before we're all snap frozen".
6 years on and Toronto still never ceases to surprise and baffle me.
If you're wondering what happened last week here's a quick rundown:
Problem: Failed DNS server and a failure of our backup service to take over as we had anticipated.
Solution: David has worked through everything with our DNS providers and sorted everything out, as well as adding a third level of redundancy courtesy of Albert Pascual.
Problem: Degraded internet connection feed into one of our matched firewalls. This is out of our control and should never, in theory, happen. But Murphy rules. We have redundancy in our firewall setup to handle just such a case, but the redundancy depends on a connection failing, not simply degrading. In the process of diagnosing this David has discovered that our painfully expensive firewalls may not be able to handle as much load as we would like.
Solution: We're working with our provider and gritting our teeth and pricing replacements for our firewalls.
Possible problem: Potential Windows Service Pack issues. Some of the problems coincided with a set of recent Windows updates being installed.
Solution: We've uninstalled these which seemed to speed things back to normal but we won't know until next high-tide, probably Monday morning PST.
Problem: A new bottleneck. We increased the capacity of the database servers to the point where the webservers were no longer waiting on them. This allowed the webservers to push through enough pages to hit their CPU limit. Unfortunately when the servers max out they automatically cycle, which redirects load to another server, causing a chain reaction.
Solution: Adding more servers, relaxing the cycling rules and, of course, optimising code are all being done to alleviate this.
Problem: Web server hardware failure. A couple of the servers simply rolled over and died. We've grown the site from home-built white boxes and are rapidly replacing everything with HP and Dell servers. Building machines to specific specs allows you versatility and massive savings, but there is a point where speccing, building, testing and maintanence outweigh the costs of buying something built and tested for a specific purpose.
Solution: New brand-name, warranty'd servers. HP servers are butt-ugly, but they do the job.
Problem: Load balancing issue. I can't, for the life of me, understand why load balancing is such a hassle. Windows load balancing works OK - not great, but OK. Until you try and mix Windows 2000 and Windows 2003 boxes: they don't like to play together. So we've used our firewalls for load balancing and no matter what setting we use - random, round robbin, weighted-least-connections - we get clumping whereby one server gets too many connections while others are starved. This in itself is not the main problem. The main problem is IP affinity, whereby a user is stuck on a single server instead of being shunted around the server farm. So, if you are stuck on a fast server it's great. However, if you're stuck on a slow server then no matter what you do your in for a slow ride. Refresh the page, close your browser, use a different browser - it doesn't matter. You'll always end up on that server until it cycles.
Solution: We've disabled hardware load balancing, standardised on a single platform and reenabled Windows load balancing.
What we've learned:
When it rains it pours
If you are told "this can't happen", it will. Murphy's Law rules. Keep an eye on services that you aren't in control of and make sure you talk to your providers about how, exactly, they handle failure events. You might get scary answers that challenge your assumptions.
Test service packs
There is a huge opportunity for someone to come up with a simple, integrated web farm management package that handles monitoring, cycling and load balancing.
As many are aware we've had a run of bad luck with some of the servers. Frequent "500" server errors which cause the blood pressure to rise. A 500 error can be raised either by a bug in the code behind a site, or by IIS / ASP.X crashing. The ones that we've been experiencing are mostly the later: ASP falls over and can't get back up and from then on all it will serve are 500 errors.
The question is: why? Why would ASP fall over? Partly because it's not robust, sometimes because the load on it is so high it simply can't take it, and sometimes, as in our case, because of faulty hardware.
So in answer to "why were we having so many 500's last week" the answer is "because our hardware failed". A HDD was failing intermittently on a single server and that was enough to cause pain for a percentage of our page views. It failed enough to cause problems, but not consistently enough for for the red flags to be raised. In hindsight we should have stopped combing through code, IIS settings, and load balancing to try and find the problem and checked the hardware (since all machines have the same software setup) but as we all know there's nothing quite as accurate as hindsight.
Upgrading to bigger, better, "more stable" systems should be easy. It should be something that, once the work of physically making the changes has been done, just works. After all, the new systems are, by definition, bigger, better, and more stable.
But life isn't like that.
We're in the process of upgrading our servers from Windows 2000 to Windows 2003. Yes, it would have been nice to have done this move sooner but a number of issues have delayed us, including cost of upgrading hardware, cost of new licences, and the most finicky of them all: problems with Windows 2003.
Not problems as in "Windows 2003 doesn't work". Windows 2003 is more stable, IIS6 is way better, and security is tighter. Unfortunately code and components that had run fine in 2000 can fail in 2003. One of the issues we had was an ISAPI filter that died because of the changes made to the way websites are identified within IIS6. Another issue we had was some .NET 1.1 code that worked as smooth as the proverbial baby's bottom on 2000 suddenly rolled over and died in 2003. Not immediately, not with any useful error messages - it just sent IIS into a spin and bought down the site.
So we're being a little tentative with our migration plans. Where we can we will move away from ISAPI filters and use HttpModules. We've abandoned the tried and true .NET 1.1 and are doing all development in .NET 2.0 beta 2 (if tried and true doesn't work then why not just go crazy and use bleeding edge technology, right?). We are moving one server at a time over to the new OS and codebase and some machines are happy, some are not. The ones who aren't typically are older and so are being retired and new machines purchased to replace them.
So today sees the deployment of the first batch of new 2.0 beta 2 code. There's a chance you've already unwittingly stress tested it for us (though you'd be hard pressed to spot it). If things continue as well as they have so far then we'll move over the rest, piece by piece.
So why the new features? Why upgrade old code instead of moving forward with fresh infrastructure?
Because it's fun.
Actually the answer is a little more complicated than that. We are continually planning and introducing upgrades to both software and hardware to keep up with the increased load. As we upgrade one piece, another piece becomes a bottleneck, so our planning usually runs along the lines of "what's the minimum that will make worthwhile a difference, and what other pieces will be affected".
The current series of upgrades will involve some fairly dramatic purchases and will also involve partnering with other companies in order to bring in specific expertise. In order to do this, though, there's the age old problem of finding a credit card big enough to lump it all on, so much of the work in the past few months has been boring, painful but critical financial planning and costing.
The upgrades come in several parts: hardware - big, loud, humming lumps of beige; software - in the form of upgrading OS infrastructure and the code base; and appliances - dedicated boxes that will provide specific services such as caching.
The planning of the hardware and appliances are pretty much done. It's just a matter of manpower and dollars to get it all implemented. The software side of things is a little different.
A lot of work has been done in investigating different methodologies, different frameworks, different features. My TODO list not only lists over 150 Things That Must Be Done, but there are also around 70 Things I Would Love To Do wishlist items.
Planning, testing, thinking and designing are important, but eventually you just want a taste of how things could be. To that end, some of the ideas that I've been thinking about for a while such as the changes to the discussion boards have been retro-fitted to the old code as part of a working prototype. It's far more fun to test on a live audience and get valuable feedback now than it is to wait till it's all done and unveil in one big lump.
Send me feedback, send me ideas, and don't be shy if you hate the changes. I might not agree, but I always listen.
So again we find ourselves in the hosting facility, screwdriver and large mallet in hand. Usually when we get an alert that the site is unresponsive it's due to something simple: a server having an aneurism, a switch collapsing due to fatigue, the Province of Ontario having a blackout. Today we rock up to the hosting centre wondering what the God of Intermittent Network Connections has dealt us today and we find half the flooring of the centre ripped up, cabinets open with miles of wires spilling out, and lots and lots of keen but slightly haggard network admins with crimping pliers running around doing...stuff. We're not sure what. We just know that for the last 15 minutes no one can get to the site.
The problem, it seems, isn't them (Wide eyed innocent looks all round, pliers hidden behind backs). But it's not us either. We can ping outside our cage into their equipment. They can go from their equipment to the outside world. But somewhere there's a break and no one can find it.
And then it fixes itself. But we can't see that it's fixed itself because our KVM switch (that thing that allows one keyboard/video monitor/mouse to hookup to multiple machines) has quietly expired.
So everything is again working perfectly. Redundant power, redundant firewalls, redundant connections, redundant servers. Not that we can actually see anything happening while in the cage.
Over the past few weeks performance has been the major focus of attention. The work was two-fold: upgrading hardware to ensure we're not limited in that regard, and reworking the code and database to ensure we use that hardware with the greatest efficiency. Our operations guy slipped in a nice new shiny database server that should have enough capacity to see us good for a while. The code upgrades were a bit more involved.
I've received a lot of helpful suggestions from many readers on their experiences, the things that worked for them and major gotchas. Sensible things like removing unnecessary queries, caching as much as possible, using firehose cursors, providing lock hints to minimise locking issues and making sure indices are correct. I would like to give special thanks to Andy Brummer for taking the time to analyse our SQL profile traces and make recommendations. This has been an on-going iterative process and the results have been impressive.
The initial step was to perform a SQL trace to see what queries were the most expensive and which could be reworked or removed. It may seem obvious but there's no use in removing or reworking something if it's not a problem. Likewise, if we find a query that is a big problem but an absolute necessity then at least we have a starting point from which to measure subsequent improvements.
After our initial tests I reworked the infrastructure as follows:
Removal of features that are high load
Viewing the messages posted by a given member, the birthday list, the number of new messages in each forum since last visit, the 'last page' button in the forums (fora?*), and removal of the link to the members list page in the lounge all contributed to a reduction in load. Some of these features were of interest to only a very few people so I felt justified in nuking them. Other features, such as the number of messages since the last visit are useful, but expensive. In the end a simple compromise was made whereby fora with new messages are listed in bold. It's not so much how many new messages have been posted, but whether that number is greater than zero.
As well, removing links to high load features discourages abuse of these features and reduces the number of site crawlers that inadvertantly hit a page. As a resource site, we are spidered and indexed frequently.
Each forum can now have its top N messages cached at the webserver level, and I can turn on and off caching for individual fora and change the number of messages cached on a per-forum level. Caching only works when the user has selected 'messages in the last 3 months' (since setting the date to anything else changes the messages to display, and the caching isn't smart enough for that). This setting is the default setting so spiders and crawlers will experience (and contribute) little load.
The caching is a generic webclass and as such it can be applied to other features. The number of members and number of articles on the homepage have also been cached alongside the other homepage items.
Lock hints were added to the article search query. Most of the data we need to display is not time-critical so if it's displayed 10 seconds out of date no one will notice. By adding local hints to queries we were able to reduce the number of lock waits.
At this point we had a nice boost. Many of the worst offenders had simply been removed. We had essentially torn out the bigger weeds from the garden and could now concentrate on more insidious creepy-crawlies. More traces were done and the results used to fine-tune:
A server-side cursor was hiding in the search page. This was converted to firehose and resulted in a a nice boost. Actually not so much a boost, but the removal of a slow point.
Important indices on many tables were lacking, even after careful and repeated use of the SQL Server Index Tuner. This has significantly reduced the number of massive table scans and so has reduced load spikes.
Reworking queries and adding indices go hand in hand. In my soul I'm a C++ programmer and sometimes forget that I'm not programming against a chip with a math-coprocessor, but against a SQL interpreter that has to first seek out the data on which to operate before doing the actual operations. A great example was a search for a specific forum type. ForumType is a bitmask and I wanted forum types 16 and 32. Hence:
SELECT * FROM Forums WHERE (ForumType & 48) > 0
However, this means every row in the Forums table needs to have the (ForumType & 48) operation applied, meaning a full table scan. Replace this with
SELECT * FROM Forums WHERE (ForumType & 48) > 0 and (ForumType >= 16)
And now you can take advantage of an index on ForumType and reduce the number of rows to which the (ForumType & 48) need be applied. This can result in a significant performance improvement.
Still to do are a number of important steps:
Reduce page and site features based on the visitor. If it's a search engine spider then expose the articles and information in a simple manner that reduces database load
Improve spider and stripper recognition. Recognised site strippers are already banished but most pretend they are IE and slip through.
Rework the table structure to minimise joins. This will involve careful denormalisation of some tables, while reducing flexibility in others. The point being that the flexibilty built into the current system isn't being used so it won't be missed, nor will the current performance hit caused by the current schema.
Reduce the physical size of tables by archiving old data or vertically partioning data into separate tables.
Offload some of the logging work to a separate database server to free up the database server for more immediate tasks.
*Yes, this is the correct pluralisation. Sounds a little weird, but who am I to argue
Chris Maunder wrote: SELECT * FROM Forums WHERE (ForumType & 48) > 0 and (ForumType >= 16)
This is why i'm sooo bad at SQL...
Chris Maunder wrote: # Improve spider and stripper recognition. Recognised site strippers are already banished but most pretend they are IE and slip through.
So... just out of curiousity... what kind of behavior is acceptable, or more to the point, what kind of behavior is likely to get banned? I'm guessing (based on experience) that reloading The Lounge every 3-5 minutes is ok... but is a script that pulls it down every 10 seconds pushing it? How about every 10 minutes? Is this already documented somewhere in big, bold, red text? And, not that i'm planning on wallpapering my house with an entire printout of CodeProject or anything like that, but which pages would you especially rather not be hammered on?
I'm not the Jack of Diamonds... I'm not the six of spades.
I don't know what you thought; I'm not your astronaut...
I have some questions and suggestions, just ignore them if you have already herd them.
I was just wondering, what version of SQL you are running? I assume it is MSSQL 2000. If that is the case, I strongly suggest you take a look at MSSQL 2005. It has transactional based replication that can be used with load balancing. As you probably know, this will allow you to just drop in another SQL server to increase performance. This replication is a lot better then 2000s. While I do not work on the DB end personally, I know we have had a lot of success with 2005 so far. It is a very stable Beta. Also, have you considered moving all [downloadable] files to a separate server?
We will absolutely move to SQL 2005 as soon as Microsoft releases it to manufacture. I didn't realise it handled replication with load balancing - finally! But above and beyond that there are general perf increases we'll get for free, and features such as being able to specify the first and last row to retrieve that will make code logic cleaner and reduce data crunching and transfer.
We've been looking at how to open up bandwidth for downloads while ensuring we don't overload the site itself. One thing at a time.
Last Visit: 2-Apr-20 12:47 Last Update: 2-Apr-20 12:47