Avoiding the Pitfalls of 3rd Party Components in a Large Data Processing Application

Redgate Software

1.00/5 (1 vote)

Oct 4, 2013

CPOL

9 min read

23749

How a Software Engineer at an Australian defence contractor tracked down a memory leak and improved performance of a complex application handling high-speed data visualization.

Avoiding the pitfalls of 3rd party components in a large data processing application

As a software engineer at an Australian defense contractor, I work on a wide variety of projects on a number of platforms – from C++/ASM signal processing deployed on underwater processing nodes to C#/WPF high-speed data visualisation for real-time shipboard systems.

The current project my team and I are working on involves updating our core acoustic analysis tool written in Borland C++ in 2008 to utilise C#/WPF. The application reads multi-channel acoustic data recorded underwater, applies frequency signature correction curves and visualises it in a LOFAR.

The application works in near-real time, allowing the operator to analyse a channel of acoustic data immediately after a recording segment is completed. The faster the analysis process can be completed, the faster the processed results can be given to the customer.

Each channel of acoustic data can be up to 1GB in size. The transition to newer development architecture has allowed us to move to 64-bit, and cache large amounts of data in memory at various processing stages. This vastly improves the responsiveness of the application, and allows for processing workflows not previously available.

The initial development stage focused on the core processing window (pictured above), which is also the part of the application that performs all of the heavy computational work.

During testing on some real world data sets, I noticed that, when zooming in on the LOFAR or changing any processing parameters requiring a redraw of the LOFAR, the memory footprint of the application would increase. After many zoom/processing changes, the application would also begin to slow considerably.

The generation of the LOFAR is performance intensive, and involves FFT processing the acoustic data, applying calibration transfer functions and then decimating for display in the UI bitmap. The decimation step is key to ensuring that no visual information is lost in the translation from the raw FFT resolution data (up to 24000x48000 pixels) to the displayed WriteableBitmap resolution (typically two thirds of total desktop resolution).

The FFT processing and decimation routines perform signal processing on large arrays (both managed and unmanaged) using a third-party C/assembler signal processing library. In addition to this, several moderate-sized unmanaged arrays are created and destroyed along the way.

I assumed that this was where the problem was occurring – that we were not freeing our unmanaged memory each time. However, some simple debugging showed this was not the case. The good thing about operating with unmanaged memory in C# is that you control and know when and where it is being allocated/freed!

As I’m relatively new to .NET and managed memory, I then assumed that I was dealing with a garbage collection issue: all of the unmanaged memory was being cleaned up, I could see that all our large cached arrays were being appropriately reset, and references to them were not stored anywhere. However, my lovely GC.Collect() hack had minimal effect on the problem.

The only other potential issue I could think of was inadvertent lingering references, such as subscribing to Events and not unsubscribing, or storing references to objects in lists and not removing them when no longer required. Thus began a manual search through the codebase…

Cue lots of ‘Find All’ within Visual Studio, looking for "+=", "Add", "AddRange", etc. The manual search was exhaustive, time-consuming, and didn’t turn up any obvious problems. I was frustrated and annoyed that I’d spent so much time on this without being any closer to finding a solution, at which point I thought, "Bugger it – maybe I’ll try a memory profiling tool and see what it comes up with."

A Google search gave me several options for memory profiling tools. As I’d previously used Red Gate’s .NET Reflector, and found that it was a great tool, I thought I’d give their memory profiler a go.

The ANTS Memory Profiler installed easily. I’d not used a memory profiler before, but I found it to be very intuitive. I simply selected ‘Profile Memory’ from the ANTS menu within Visual Studio.

With the application running, I loaded a section of acoustic data. I took a baseline memory snapshot, then proceeded to zoom in and out of the LOFAR display numerous times, taking other memory snapshots along the way. Whilst the snapshots took a reasonable amount of time to record, I didn’t notice any difference to the application performance whilst profiling. After enough zoom in/out iterations to cause a significant (500MB+) amount of memory leaks, I stopped profiling and loaded the report.

Along the top of the report was a graph detailing the application’s memory usage over time, along with the times at which my memory snapshots were taken.

I selected snapshot 2, which was taken after the initial load of the acoustic data and population of the LOFAR. The ‘Largest classes’ pie chart on the lower left showed that the majority of data was of type Int16[] (acoustic data) and Single[] (FFT data), which was as expected.

I then selected snapshot 6. It showed that the size of the acoustic and FFT data remained the same, which was reassuring. But what was using the additional 400MB of memory between the two snapshots? The ‘Largest classes’ pie chart gave some indication, with classes of type EffectiveValueEntry[], Gripper, Thumb and Canvas alone taking up 200MB of additional memory!

To investigate this further, I set snapshot 5 as my baseline and snapshot 6 as my current. This section encompassed a number of repeated zoom operations, and selecting the ‘Class List’ button showed up a wealth of interesting information.

As I was concerned about the volume of memory used, I sorted by the ‘Size Diff’ column, which showed an unfamiliar object EffectiveValueEntry.

I used a comparison of two checkpoints before and after a series of zooms. To my surprise, this showed that in a short time period, some 3.5 million objects were constructed! Looking closely at the number of individual classes being created, I saw that there was a common denominator of approximately 108,160, with the other objects being a multiple of this number. This indicated that they were all related.

And the class name that jumped out? Visiblox.Charts.Gripper. Visiblox is the third-party charting toolkit that we use for visualising a number of our data sets. So why was the use of this class leaking so much memory, and why were so many instances created in the first place?

I selected the Visiblox.Charts.Gripper entry and used the Instance Categorizer tool. Once loaded, it showed the reference paths between the instances of the Gripper class and all other classes, and it immediately showed the link between the LofarViewModel and the Visiblox.Charts.RectangleAnnotation classes. At this point, I had a fair idea of what could be causing the problem – something I had glanced across earlier during my lengthy (several hours) manual search.

I then selected the Instance Retention Graph, which traced the reference paths visually across method calls for a selected object (one I picked at random) and confirmed my suspicions.

When the SelectionDurationChanged event is fired from the SelectionModel, the topSelectionRectangle (among others) representing the top of the selected region has a new point assigned to it, reflecting the changed selection.

topSelectionRectangle.Points[0] = new DataPoint<double, double>(XRange.Minimum, YRange.Minimum);
topSelectionRectangle.Points[1] = new DataPoint<double, double>(XRange.Maximum, GetTopSelectedYValue());
bottomSelectionRectangle.Points[0] = new DataPoint<double, double>(XRange.Minimum, GetBottomSelectedYValue());
bottomSelectionRectangle.Points[1] = new DataPoint<double, double>(XRange.Maximum, YRange.Maximum);

The data points assigned to the selectionRectangles, which are ultimately bound from the Visiblox chart, are newed. It is probable that the chart registers to a PropertyChanged event or stores a reference to the DataPoint.

To test my theory I replace the above code segment with:

topSelectionRectangle.Points[0].X = XRange.Minimum;
topSelectionRectangle.Points[0].Y = YRange.Minimum;
topSelectionRectangle.Points[1].X = XRange.Maximum;
topSelectionRectangle.Points[1].Y = GetTopSelectedYValue();

bottomSelectionRectangle.Points[0].X = XRange.Minimum;
bottomSelectionRectangle.Points[0].Y = GetBottomSelectedYValue();
bottomSelectionRectangle.Points[1].X = XRange.Maximum;
bottomSelectionRectangle.Points[1].Y = YRange.Maximum;

And, amazingly, the memory leak issue was resolved! However, I was still wondering how 108,160 of these objects could have been created. It was as though the SelectionDurationChanged event was being fired many, many times.

Which, it turns out, it was. And ANTS Memory Profiler had even shown me where without me realising it. In the LofarViewModel, the SelectionDurationChanged event was called every time that a row in the LOFAR was generated. Which, given the number of zoom in/out operations, means it was called approximately 5000 times!

The firing of the event from this location was a bug, inserted early in development as a ‘belts and braces’ approach to ensure the selection area was valid upon redraw. However, the redraw case was handled properly at a later stage, but the offending line was not removed!

At this stage, two obtuse bugs had been found and fixed in less than an hour. And to top it off, the application was now redrawing the LOFAR ten times faster than previously.

So what has this experience taught me?

Be careful with third-party libraries. They may have implied expected behaviour that you may not be aware of. In our case, we were lucky to have a bug that exposed this problem at a much faster rate than normal
In a C#/WPF world where references are passed around willy-nilly, you have to ensure that they are not inadvertently stored anywhere that may hang onto them and prevent objects from being cleaned up in Garbage Collection. This includes WPF bindings, unsubscribing from Event handlers when no longer required and removing items from collections when no longer required.
Monitor your application’s vital statistics and profile your code regularly. I was surprised that no one else noticed or questioned why the application memory use was climbing. Perhaps my particular test scenario was the only one to stress this particular fail case. Either way, keeping an eye on resource utilisation and regular profiling, perhaps before milestone releases, to ensure the application is running lean and not leaking memory is advisable.
A dedicated profiling tool saved time. Despite having a fair idea of the type of issue I was looking for, the fact that it was an interaction with a third-party library meant that I could not manually dig through their code and locate the fault.

I spent four hours manually verifying that every Collection we had was not holding onto references, and that every Event/Message subscribed to was properly unsubscribed. Tools such as Process Explorer were great for saying "Yup, you’ve got a problem", but were unable to point the finger at any particular area. Considering the average hourly cost of development time, you only need ANTS Memory Profiler to save you a few hours in total to get a serious return on investment.

And I’m sure this is not going to be an isolated case. We’re software engineers: writing bugs is part of what we do. We just need to arm ourselves with the right tools to ensure that they are found and fixed in an expedient manner.