12,063,909 members (56,635 online)
‘Ingeniously simple tools’, the box promised. Yeah, right – I’d heard that one before. Half-heartedly, I installed yet another tool: ANTS Performance Profiler from Red Gate. As the progress bar filled up, I wondered if I’d be able to find a solution to my problem, before time – or my sanity – ran out, or my CEO's patience.
My story begins three weeks earlier, in Cape Town. As the IT manager and senior developer for a financial institution, the buck stops with me when there are code problems. We have a small development team, and as is the nature of small teams, we find ourselves wearing many hats.
Although the team is small, the company relies heavily on our technology and especially on our ability to make rapid changes. This gives us an edge over our competitors who normally have longer turnaround times with their technology changes.
Within our environment, we have several SQL Server databases; we have 3rd party software that manages our debtors’ book, accounts, and collection strategy; on top of that, we have an in-house ASP.NET application that our 80+ users work with on a daily basis. The application performs all manner of tasks, from CRM, to account maintenance, journal management, and accounting. Each month, peak operation is around 100,000 database transactions a day.
Our most mission-critical application is in daily use, and if it ever goes wrong, heads will roll. It’s a console application written about eight years ago in .NET Framework 2. Several people have worked on the code, and over time, quick fixes have been made to address changing business needs.
The application uploads a collection file to a bureau that deducts instalment amounts from debtors’ banks accounts and pays over to our bank account. If payment doesn’t get taken from an account, the chance of getting that money from the debtor decreases by 25% for every day not collected, resulting in cash flow problems as well as increased work and pressure for the Collections team. In a nutshell: don't mess this application up.
The software wakes up every morning at 10am and collects debtors’ information from various databases. It then compiles several files, connects to an external bureau via web services, and uploads the files. The app then polls for a return file and a whole bunch of activity takes place. The final step is a human process that must authorise the collection files. All of this must happen before the 10:30 cut off. Miss the cut off, and, well… you know what hits the fan.
One morning, the application collapsed with an obscure error indicating a database timeout. And then it happened again. And again. For weeks, the app would periodically crash, and we were powerless to do anything – as it talks to three databases, we weren't even sure where the timeout was occurring. The crashes started taking their toll on my team: in order to make the tight deadline, we usually had to upload the files by email, and then manually enter the data into the database.
We had to start the process earlier and earlier to ensure we could make the cut-off, which in turn put pressure on the Collections team, as they only had a few hours to make any adjustments and load accounts for the daily run. The collections manager had a daily blue-purple fit and ordered his team to come in earlier to cope.
As an IT manager who’s been with the company over 20 years, it is a terrible thing to sit in an emergency management meeting, in front of my leaders and peers, and have no choice but to say those words I detest most in the world: "I don’t know."
To top it all, the largest and most critical month end run was fast approaching. This run affects people's lives, as our collectors work on commission. To think of those people not getting paid properly, because I did not know, weighed heavily on my shoulders. I needed to find an answer.
Like demons, we attacked and analysed every single stored procedure and SQL statement, and we found nothing. There were no bottlenecks in the database: the database loop was working just fine.
But then where in the application was the database timing out? We went through the code until our eyes watered, but we could not identify where the bottleneck was. Eventually, after a Google search, I downloaded some tools to help identify bottlenecks. Most were too complex to set up, some needed code changes, and some should have been implemented eight years ago. One seemed to offer a bit more hope – Red Gate’s .NET profiler called ANTS Performance Profiler.
I installed it and quickly loaded the project. So far, so good – easy to use and no hiccups. I ran the project and reviewed the analysis. The interface was awesome, and had a whole bunch of information that for once made sense. Nice plus: it gave me performance data for both my .NET code and SQL Server queries, all in one single profiling session. Neat!
What gave me a jolt of excitement was seeing something I hadn’t seen anywhere else: a spike in the wall-clock CPU time, one minute into the process.
A little further investigation, and bam – there it was. The high hit counts and time outs led me straight to the real culprit: basTools.
Inside the basTools module were some standard basic tools we use in every project. It hadn’t crossed any of our minds to check if they might cause a timeout problem in SQL Server.
The problem took all of two seconds to fix, and to this day I sometimes bang my head against the wall, re-living the moment.
And what was the problem? When an account has an alert on it, an automatic email gets sent to an account manager. We had discontinued one of the product types, and recently re-instated it. This triggered an alert that called an old "discontinued" mail routine. Inside the mail routine was – wait for it – a hard coded IP address for a proxy server. The IP address belonged to a server that no longer existed. The application was trying to resolve the proxy inside a SQL loop, failing, and eventually giving up, reporting a timeout.
We pointed the IP address to a new server (this time, storing the setting in the config file) and ran the project successfully. To date, touch wood, we haven’t had a single timeout.
In conclusion, ANTS Performance Profiler saved the day as well as my hide, and my team came out looking like heroes. The error message looked as if the timeout was coming from the database. We never once suspected faulty code. The .NET profiler pinpointed the culprit in just a few minutes (a piece of legacy code with a relative distant relationship with the database) and we were able to improve application performance almost immediately.
ANTS Performance Profiler is indeed an ingeniously simple tool. A tool that does what it promises to do, and does it well.
What I don’t like about ANTS Performance Profiler though, is that we didn't install the darn thing years ago.