|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionThis is an in-depth look at the arrays in the Common Language Runtime and the .NET Framework. This study details the implementation of arrays and describes efficient ways of using them. This is my second article in the UNDOCUMENTED series, following my earlier article on strings. The intent of writing articles in this series is to help me understand how to use the C# efficiently to develop a serious commercial application. I am ex-Microsoft developer of Excel, who has started his own software company, developing applications employing artificial intelligence. Since my previous article on String Undocumented last month, I actually started doing some contract work on the side for Microsoft; I am actually working for the Windows group and can't believe the massive amount of great stuff that's gone on for the next generation of the NET framework. But, because I have signed an NDA, I have limited all my information to publicly available information and won't discuss unreleased products. BackgroundThe Array class is the base class for arrays used by the compiler and the runtime. Along with strings, arrays (including descendent classes) are the only types which are of variable length. The rank of an Array is the number of dimensions in the Array. The lower bound of a dimension of an Array is the starting index of that dimension of the Array; a multidimensional Array can have different bounds for each dimension Internally, the runtime maintains two separate implementations of arrays -- optimized
In version 1.0 of the CLR, each heap-allocated object consists of a 4 byte objectheader and 4 byte pointer to a method table, so each arrays have this initial overhead; in addition, both
MDARRAYS contain these additional internal internally.
Thus every access in a regular array must examine several internal members. For high performance, two approaches for multidimensional arrays are possible: the aforementioned, "jagged" C# arrays or manual calculation within Non zero-based arrays have the same shortcomings as multidimensional arrays, since they are both For instance, arrays with negative lower bounds produce interesting bugs. Two arrays are considered the same type if they have the same rank(number of dimensions) and the same element type. In contrast to C/C++, the upper and lower bounds of each dimension are not considered, not even inner dimensions in a multidimensional array. Methods such as Beyond the base size, both If the data is a value type, then the elements would be of the same size as that value type. Reference types consume
In contrast to strings, the other variable-sized object, none of the internal array fields are exposed in Reflection. Accessing these internal fields requires the use of unsafe code. Since these fields are already exposed through public methods and properties, proving any source code as I have done in "Strings Undocumented" for such operations would be redundant. Programmatically, an array is an SZARRAY if and only if The Dynamic ArrayListThe ArrayList is a very useful class for dealing with dynamic arrays. However, it serves more general purposes as well--encapsulating collection classes and allowing special operations to be performed on these classes.
Altogether an ArrayList consumes 20 bytes (8 byte object overhead + 12 bytes for the instance information), not including the space for the underlying array. If the array needs to grow beyond its capacity, a new array will be constructed with twice the previous capacity or the new desired size, subject to the maxcapacity constraint. This approach takes O(3n) which is linear time. The alternative approach of growing the array by a fixed amount rather than a percentage results in quadratic time performance, O(n^2). For optimal performance, arraylists should be preallocated if the size is known to reduce unnecessary copying. Compacting the size of the ArrayList when done requires a call to Completely releasing the space of an array requires calling
It's rather straight-foward to convert back and forth from Arrays to ArrayLists. Arrays can be transformed into ArrayLists using You can also accessing the underlining array of an ArrayList by calling Array Manipulation Without ArrayListsManual Array ResizingArrays can be resize manually. Here is a useful function that should have been provided by the array class. To mimic the behavior of ArrayList manually, a call to ensure should be placed before any potentially invalid indexing.public static Array Resize(Array array, int newSize) { Type type = array.Type; Array newArray = Array.CreateInstance(type.GetElementType(), newSize); Array.Copy(array, 0, newArray, 0, Math.Min(newArray.Length, newSize))l return newArray; }The Resize method uses Array.CreateInstance to do late-bound construction. Array MovementTo manual move elements within an array, the Array provides a general Copy function to copy data from one array to another. This function also works withthe same array. Range checks are performed just once. Within the same array, the copying behaves like the C standard library's memmove function rather than memcpy. The InsertHelper function shifts the contents of array to make room for count elements at the indexth position. Any elements towards the end are right-shifted out of the array and disappear. Similarly, the RemoveHelper method removes elements from the specified position, shift trailing elements to the left. public static Array InsertHelper(Array array, int index, int count) { Array.Copy(array, index, array, index+count, array.Length-(index+count)); array.Clear(index, count); } public static Array RemoveHelper(Array array, int index, int count) { int copy = ; Array.Copy(array, index+count, array, index, array.Length - (index+count)); array.Clear(array.Length - count, count); } The Buffer class also provides useful functions, GetByte, SetByte, ByteLength and BlockCopy, for manipulating arrays of value-types, that no not contain any internal object references. In fact, the types of the elements are ignored as the Buffer class treats each array as a range of bytes. Arrays of different value types can be copied in one another, so, for example, floating-point values can be overlaid unto integral data and vice-versa. To use this class, it is helpful to use the sizeof keyword or Marshal.SizeOf to obtain the exact size of the valueType. When copying elements inside a multidimensional array, the array behaves like a long one-dimensional array, where the rows (or columns) are conceptually laid end to end. For example, if an array has three rows (or columns) with four elements each, copying six elements from the beginning of the array would copy all four elements of the first row (or column) and the first two elements of the second row (or column). Arrays of BitsOne class that should not be forgotten is the Pascal set-like BitArray, which works surprisingly like an array of booleans in code. Booleans in .NET occupy a single byte, while, more compact than the int-size C++ bool, is still quite wasteful in an array. BitArray is implement as an array of Int32s, each consisting of 32 bits. Using an array of ints instead of bytes allows a single instruction to access and modify 32 bits at a time instead of 8, for a four-fold improvement in performance for many operations. Additional, operators for BitArray include And, Or, Xor, and Not. There's also a cousin BitVector32, that will use an int for a small set.Array Casting & ConversionArray Covariance: Conversions of ArraysArrays of references types support a feature called Array covariance, that mimic the ability of C++ to cast an array of pointers of one type to another type. One array can be converted to another array of the different type at compile time, if there is some built-in conversion, either explicit or implicit, between the two. The two arrays must also have the same rank. The array is reinterpreted and no underlying physical changes made during the conversion. If the conversion is implicit, (that is, the element type of the array before conversion is being converted to an interface that it supports or to a base type) a cast is not required and no runtime check is performed. If the conversion is explicit, (the conversion is from an interface to another type, from a base type to a derived type, or from a base type to an interface not supported directly by that base type) an explicit cast is required and a runtime check is performed. Each array of references, as mentioned earlier, has an underlying element type that remains fixed throughout the conversions. The runtime check is performed to ensure compatibility between the underlying element type and the new element type that it is being reinterpreted as. A few examples should make it clear: public class Animal {} object [] data = new Animal[2]; // Animal [] is converted implicitly to object [] Animal [] animals1 = data; // Error: explicit conversion from object[] to Animal [] // is required Animal [] animals2 = (Animal[]) data; // object[] is converted explicit to Animal [] string [] strings1 = (string[]) animals2; // fails at compile time because // no conversion exists between two string [] strings2 = (string[]) data; // succeeds at compile because explict // conversion exists between object and string // but fails at runtime, because underlying // Animal array is not derived from string // array object [] data2 = new object[1]; data2[0] = new Animal(); Animal [] animals3 = (Animal[]) data2; // succeeds at compile time because explicit // conversion exists between object and Animal // fails at runtime even because data2's // underlying data type is object[] which is // not derived from Animal[] even though all // the elements of data2 are currently // Animals, the cast still because data2[], // being an array of objects, is not // constrained to Animals and // potentially data2[0] could be assigned // later a string or other type Being able to reinterpret an array of one type to an array of interfaces, base type or derived achieves efficiencies in time and memory, that would have disappeared if another array of the other type were required to be constructed. public void Test() { string data [] = new string [] { "a", "b", "c", "d", "e" }; SetRange(array, 1, 3, "x" ); } public void SetRange(object [] array, int start, int count, object value) { for (int i=0; i < count; i++) array[i+start] = value; } In the above example, the new string would appear as { "a", "x", "x", "x", "e" }. Though we did not explicit code to handle strings, array covariance allows SetRange to work for string arrays, nevertheless. Passing an integer as the parameter value would have resulted in a runtime exception because all array assignments for reference arrays include runtime type checking. The following example illustrates issues with using value type arrays and string arrays with param arrays. The call to Write with integer arrays results in Results in "System.Int32 []" being written out because the the array of integers is wrapped up into another newly constructed array of objects. However, the call with string arrays results in "a", "b", "c" being written out; because of covariance, the string array becomes the args parameter. // This illustrates the difference in treatement between array and value types Write( new int [] { 1, 2, 3 } ); // Results in "System.Int32 []" being written Write( new string[] {"a", "b", "c" } ); // Results in "a", "b", "c" being written void Write(params object [] args) { for (int i=0; i<args.Length; i++) Console.WriteLine(args[i]); } Everything has some cost. The downside of array covariance is anytime an element in assigned a new object, that object must be type-checked at runtime. There's always the choice of using Coversion of Array Elements (Array.Copy)When copying elements between arrays of the same types, Besides copying elements between arrays of different types, public Array Convert(Array array, Type type) { Array newArray = Array.CreateInstance(type, array.Length); Array.Copy(array,0, newArray,0, array.Length); return newArray; } ArrayList ViewsSome of the versatility of array lists are its ability to construct various types of views of other Arrays, ArrayLists and ILists. AdapterArrayList.Adapter allows any IList (that includes an array and many other collections) to be viewed as a ArrayList. This is useful in several ways: An IList can use the binary search, sorting, reverse, subrange and array conversion capabilities that the arraylist automatically provides. However, this may not be as useful for an array since it already provides all those capabilities accept the subrange views.
Array SubrangesTo extract a subrange of an array, one can write the following code. public static Array GetRange(Array range, int start, int count) { Type type = array.Type; Array newArray = Array.CreateInstance(type.GetElementType(), count); Array.Copy(array, start, newArray, 0, count); return newArray; } Because this produces an actual (shallow) copy of the subrange of the array, this potentially consume a lot of memory. The ArrayList class contains a GetRange(int start, int count) which is potentially a memory saver for large array. GetRange returns a derived ArrayList that presents a view of array. The view can be manipulated as well. Elements can be modified, added and removed from within the view; however, any modification to the arraylist from outside the view will cause an exception to be thrown next time the view is used. This method does not create copies of the elements. The new ArrayList is only a view window into the source ArrayList. However, all subsequent changes to the source ArrayList must be done through this view window ArrayList. If changes are made directly to the source ArrayList, the view window ArrayList is invalidated and any operations on it will return an InvalidOperationException. In combination with the Adapter method, a view of an array or another collection can be obtained as well. For example, Wrapper SupportArrayList contains three static methods with two overloads each that take either an IList or an ArrayList and returns a fixed-size, synchronized, or readonly IList and ArrayList respectively. The IList methods returns a lightweight wrapper class inherited from IList and consisting of one instance variable, a reference to the base IList. The ArrayList methods returns a derived ArrayList class, which consists of an additional instance variable referring to the base ArrayList class and ignores the inherited arrays. The ILists returned from the methods overloaded for ILists, are not derived from ArrayList and are somewhat misplaced in the ArrayList class. It would have been more appropriate for them to have been static members of the IList interface.For example, these are the actual implementations of FixedList. public static IList FixedSize(IList list) { if (list==null) throw new ArgumentNullException("list"); return new FixedSizeList(list); } public static ArrayList FixedSize(ArrayList list) { if (list==null) throw new ArgumentNullException("list"); return new FixedSizeArrayList(list); }
These operations can be combined for a synchronized fixed-size array or a synchronized read-only array: Readonly is especially useful for prohibiting modification. Arrays, while fixed-size, are still always passed by reference and are mutable. In marshaling, arrays of greater than ten elements are pinned rather than copied, so potentially they could be modified. Array PerformanceRange Check Elimination: Use for(int i=0; i<a.Length; i++)The C# compiler performs a special optimization that improves the performance of iterating through an array. First, compare the following three approaches to iterating through an array. Which is fastest? 1) standard iteration int hash = 0; for (int i=0; i< a.Length; i++) { hash += a[i]; } 2) iterative loop with saved length variable int hash = 0; int length = a.length; for (int i=0; i< length; i++) { hash += a[i]; } 3) foreach iteration foreach (int i in a) { hash += i; } In the current version of the JIT compiler, you may be surprised to learn the first example produces the fastest code, while the third "foreach" example produces the slowest. In later versions of the compiler, Why is example 1 faster than example 2? This is because the compiler recognizes the pattern for (int i=0; i<a.length; i++) for both arrays. Since array have constant length, the compilers simply stores away the length, so that no function call is made on each iteration. (The JIT compiler may actually inline references to the array length, since the current version automatically inlines non-virtual method calls that consist of simple control flow and < 32 bytes of IL instructions.) In addition, the compiler eliminates all range-check tests on any instance of s[i] within the loop, because i is guaranteed in the for condition to be within the range 0 <= i < length. Normally, any indexing of an array results in a range-check being performed; this is why attempting to save time by stashing away the length variable in example 2 actually results in slower code than in example 1. There also the pointer approach. fixed (int *pfixed = a) { for (int *p = pfixed; count-->0; p++) hash += *p++; } Large ArraysLarge arrays can have a significant performance hit. Any large object that consumes 85K is placed in the large object heap. Practically speaking, almost all of these objects are going to be arrays, and possibly some strings, because very few classes will contain enough fields to exceed that amount of memory. Large objects are not compacted, and are only removed in a FULL garbage collection, containing generation 2. If the large object includes any finalizers, then at least two FULL garbage collections will be required. Since FULL garbage collections can be 100 times or more less frequent than the partial collections, it can take a long time for memory to be recovered. Thus, a very poor allocation scheme, probably the worst, for a .NET application would be one that allocates large amounts very frequently for temporary uses. Array InitializationStatic arrays of primitive data types are initialized at compiled time, just as in C++. Unfortunately, static arrays of structs are not and must be initialized at initialized at runtime. The developers defer this features because of the complexity introduced by the ability of structures to contain object references. Presizing ArrayLists and Other CollectionsIn general, collections in the .NET Framework are dynamically resized by doubling their capacity each time they are full. This is an linear-time approach that takes O(3N), which is superior to the quadratic time approach of appending a fixed size to an array. But, if you know the approximate size of final array beforehand, you can essentially bring down the time to create a collection by two-thirds just by preallocating; instead of 3N copies, the collection will make just 1N copy. Each collection has a capacity setting for doing so. Use Jagged Arrays Over Multidimensional ArraysIn future versions of C# post-Everett, many of the performance issues from the multidimensional arrays will be address. For now, jagged array which rely on optimized SZARRAYs and require just slightly more memory, are significantly faster. Use strongly typed arrays whenever possibleStrongly typed arrays are highly optimized, avoiding the costs of boxing, casting, function calls and other indirection. With value-type arrays, a number of functions such as Reverse, IndexOf, LastIndexOf, BinSearch and Sort, In the future, C# in "Visual Studio for Yukon" is expected to support generics and constraints, which will complete eliminate boxing and allow functions to perform as efficiently as possible. More information is available from Microsoft's public announcement at www.csharp.net . public class Stack<ItemType> { private ItemType[] items; public void Push(ItemType data) { ... } public ItemType Pop() { ... } } Generics have several advantages over C# templates such as the elimination of code bloat. Code for specialized classes are generated at run-time on the fly, and all reference types share the same code. ConclusionThis concludes my discourse on arrays for now. I will continue update this article with new source code and actual benchmarks in the future. Be sure to watch for update versions of this page. As a result of the enthusiasm that this article has generated, I will continue to develop more UNDOCUMENTED articles. My next article will be a discussion of the implementations of arrays and collections. I hope to publish a couple dozen articles when I am done with the series. I might eventually make this into a book. My sources include various books, the shared source CLI, MSDN, magazine articles, interviews, inside sources, developer conference presentations, and some decompilers. One reader has suggested that I examined . All of this behind-the-scenes information takes some amount of work to research and obtain, so, if you enjoyed this article, don't forget to vote. Version History
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||