Click here to Skip to main content
15,867,330 members
Articles / High Performance Computing / Parallel Processing
Article

Parallel Extensions for the .NET Framework - Part I (Introduction & PLINQ)

Rate me:
Please Sign up or sign in to vote.
4.96/5 (17 votes)
14 Nov 2008CPOL8 min read 70.3K   953   47   7
This article introduces PFX for the .NET Framework and discusses PLINQ.

Introduction

Microsoft recently released the Visual Studio 2010 CTP. One of the most interesting features in this release, apart from Dynamic C#, was Task Parallel Library. TPL and PFX (Parallel Framework) are available as a CTP for VS2008 and .NET Framework 3.5. Task Parallel Library is designed to target multiple core or processor systems. It is designed to exploit the benefits of advanced hardware available without even worrying about them. The article below discusses the available APIs in PFX, and explores the performance aspect.

Development Environment

Concurrent Programming and Parallel Programming

Concurrent and Parallel Programming are very old buzz words. Both are very closely associated. Concurrent Programming is the concurrent (simultaneous) execution of multiple interacting computational tasks, which are implemented as separate programs, or as a set of processes or threads created by a single program. These tasks can be executed on a single processor, multiple processor, or on a distributed network. Parallel Programming is breaking a large problem into smaller ones and carrying the execution of each chunk simultaneously. There can be different forms of parallel computing like bit-level, instruction-level, data and task-parallelism. Of these two forms, the ones that are important in the context of Task Parallel Library are Data-Parallelism and Task-Parallelism.

Data Parallelism

In data parallelism, the focus is on distributing data across different parallel computing nodes. For a multiprocessor system executing a single set of instructions, data parallelism can be achieved by making each processor perform the same task on different pieces of distributed data. There can be a single thread controlling this, or can be multiple threads. However, the processing of data would be the same for each thread.

Task Parallelism

Task parallelism is the execution of a different thread by each processor on the same or different data. The threads may execute the same or different code. For a simple example for a two processor machine, if we want to do two tasks, it will be CPU ‘a’ doing task ‘A’ and CPU ‘b’ doing task ‘B’.

Task parallelism emphasizes the distributed (parallelized) nature of the processing (i.e., threads), as opposed to the data (data parallelism).

Overview of Parallel Extensions

Parallel Extensions are a set of APIs provided by Microsoft, mainly designed to take advantage of systems that have multiple core and multiple processors using data parallelism, task parallelism, and the coordination on the parallel hardware.

It helps developers to take advantage of parallel hardware without worrying about scalability and performance. The best thing about it is that with the number of processors increasing, developers don’t need to re-write the code.

Extensions mainly target three areas of Concurrent and Parallel Programming:

  1. Declarative Data Parallelism
  2. Declarative implies what you want to be done with data rather than how you want that to be done. LINQ introduced us to a new set of APIs based on the declarative programming model. PLINQ introduces us to a new set of APIs that applies to Declarative Data Parallelism.

  3. Imperative Data Parallelism
  4. Parallel Extensions gives an API for interactive data operations such as loops. Parallel extensions introduces method like For and ForEach that have mechanisms to automatically divide the work in a loop to run on parallel hardware.

  5. Imperative Task Parallelism
  6. Parallel extensions introduces the Task Parallel Library that has an API for expressing parallelism in the form of tasks rather than data. The extension schedules the task to run on parallel hardware, and provides us with ways to Cancel or Wait.

    The extension includes Task Parallel Library and PLINQ built on top of a scheduler. This scheduler is a robust, efficient, and scalable engine designed to use co-operative scheduling, and work-stealing algorithms to achieve fast, efficient, and maximum CPU utilization. We can also write our own extensions on top of this scheduler.

Declarative Data Parallelism (PLINQ)

Declarative Data Parallelism is done using Parallel LINQ. PLINQ is the implementation of LINQ to Objects that executes queries in parallel, scaling to utilize the number of processors available. Executing queries with PLINQ is similar to executing queries with LINQ. PLINQ supports all query operators supported by System.Collection.Enumerable.

However, LINQ to SQL and LINQ to Entities are not supported as the queries are supposed to be executed by the database or the query operators.

Converting your existing code in LINQ to PLINQ is very easy. You need to add a reference to the System.Threading assembly. The next thing you need is to wrap your data source around the IParallelEnumerable<T> interface using a call to the AsParallel extension method.

LINQ

C#
var q = from file in fileList  
         where file.Length > 10000 
        select file; 

PLINQ

C#
var q = from file in fileList.AsParallel()
         where file.Length > 10000                      
        select file;

Using the AsParallel method makes sure that the compiler uses ParallelEnumerable<T> rather than Enumerable<T> of query interfaces. However, just like LINQ, PLINQ has deferred execution, i.e., the query does not get evaluated until a foreach loop goes over it, or the GetEnumerator is called, or ToList, ToArray, ToDictionary methods are called. Now, this is a very important thing to take note of from the performance point of view.

So far, wrapping a data source around the IParallelEnumerable interface just makes sure that when execution takes place, it uses the Data Parallelism algorithm. How you process the output is totally another thing. When one wraps the data source with IParallelEnumerable<T>, the resulting collection is also of the IParallelEnumerable<T> type. PLINQ offers the following three ways to process the output:

  1. Pipeline Processing Model
  2. As the name implies, another thread is added for enumerating the query output in addition to the threads processing the data. In simple words, there would be a dedicated thread for enumeration along with several threads executing the query.

    The default method of output processing is pipeline processing. So, there is nothing different done for this from LINQ.

    C#
    foreach(var e in q)
    {
         Console.WriteLine(e);
    }

    One thread will manage foreach whereas the others will execute the query .The foreach loop uses the System.Collection.Generic.IEnumerator<T> for iterating. Hence, when MoveNext() is called, a set of worker threads execute the query, and results are returned from this and all the subsequent calls to MoveNext(). IParallelEnumarable <T> has its own GetEnumerator(bool usePipeLining) method.

    So, an explicit way of processing the output would be:

    C#
    using (var e = fileGroup.GetEnumerator(true))  
    {
       while (e.MoveNext())
       {
          Console.WriteLine(e);
       }
    }

    Passing false would be Stop and Get Processing. Pipeline processing has a benefit of reducing memory requirement for holding the result. However, it has a disadvantage that there are many producer threads and only one consumer thread, that might cause uneven distribution of work, leading to processor inefficiency.

  3. Stop and Get Processing Model
  4. Stop and Get Processing, as the name implies, is that the thread that starts the enumeration stops and joins the other thread that is executing the query to execute the query. Once the complete results are obtained, it resumes iteration over the result. To use this, we have to pass false in the GetEnumerator method of IParallelEnumerable<T>.

    C#
    using (var e = fileGroup.GetEnumerator(false ))  
    {
       while (e.MoveNext())
       {
           Console.WriteLine(e);
       }
    }

    When the MoveNext() is called for first time, the thread that has started the enumeration joins another thread and finishes executing the query, i.e., the query finishes executing when the first call to MoveNext() is made and we just iterate through the result in an in-memory collection.

    This model is slightly more efficient than the Pipeline model because of two reasons. Firstly, there is less synchronization overhead between the producer and consumer threads. Secondly because, PLINQ knows better where the data is going and how it can be accessed. ToArray, ToList sort, by default, are based on the Stop and Go model.

  5. Inverted Enumeration Model
  6. In this model, a lambda function is provided to PLINQ stating what processing needs to be done. This function is passed to all the threads executing the queries along foreach. IParallelEnumerable<T> exposes a ForAll method for this. Unlike the Pipeline and Stop-Go models, it does not collect results and then process them. Instead, the final function called is passed to each thread via the ForAll extension method. This is the fastest method, but it requires that the function passed to ForAll be thread safe and preferably free of locks and side effects.

    C#
    fileGroup.ForAll(q=>Console.WriteLine(q));

Concurrent Exception

Exception in PLINQ is different from exception handling in LINQ. In LINQ, the queries are executed sequentially, so if an exception occurs, execution is aborted. However, in PLINQ, all queries are executed in parallel. Hence, PLINQ handles exceptions differently. For example:

C#
string[] arrStr = new string[]{"1","2","3","v", 
                                "1","2","3","'",
                                "1","2","3","12222222222" };
var query = from a in arrStr.AsParallel()
            select int.Parse(a);
query.ForAll(q=>Console.WriteLine(q));  

Every time we run the query, we will get an exception. However, since the PLINQ query is getting executed on multiple threads, we might get different exceptions. I am using “might” because it all depends on PLINQ has divided and when the exception occurred. The example above has two parsing errors and one value too large error. Since PLINQ executes multiple threads, it may be possible that I might get only two input errors, or one input and one value too large errors, or may be all. Well, actually, behind the scenes, when an exception occurs in any thread, PLINQ tries to stop all the other threads as quickly as possible. This may happen in time or not. Once all the threads are stopped, PLINQ aggregates all exceptions into a new System.Threading.AggregateException and throws it. These exceptions are available as inner exceptions along with their stack trace. We can look at the inner exceptions and do the proper handling.

In my next article, I will discuss Imperative Data Parallelism and Imperative Task Parallelism.

Using the Sample Application

I have attached a sample application for comparing execution times of LINQ and PLINQ using different output processing. You can edit the query by taking a look at the FileInformation class and play along it. It’s not too robust, but can easily compare LINQ and PLINQ.

Further Reading

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer SAP Labs
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Amir Mehrabi-Jorshari19-Oct-10 23:04
Amir Mehrabi-Jorshari19-Oct-10 23:04 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.