Dynamically evaluated SQL LINQ queries

Don Kackman

4.95/5 (35 votes)

30 Nov 2013CPOL8 min read

205.2K

2.6K

Extension methods to evaluate plain text SQL queries against IEnumerable collections.

Introduction

Latest code is now maintained on GitHub.

Now with a NuGet Package.

Have you ever wanted to execute simple SQL SELECT statements against in-memory collections and lists? Well, I have, and this article presents a mechanism to do that utilizing LINQ and runtime generated and compiled Lambda Expressions. The result is support for runtime query evaluations like:

var result = source.Query<Person, Tuple<string, double>>(
               "SELECT Address, Avg(Age) FROM this GROUP BY Address"); 
var result2 = source.Query<Person, Family>(
              "SELECT Address, Avg(Age) AS AverageAge FROM this GROUP BY Address"); 
var result3 = source.Query<Person>("SELECT * FROM this ORDER BY age");
var result4 = source.Query<Person, string>("SELECT DISTINCT address FROM this")

I'm a big fan of LINQ, especially LINQ to Objects. Code that might be tens of lines long using a foreach iteration and evaluation can often be shortened to 1 or 2 lines. Anything that reduces code lines is a big plus for maintainability and quality. As I've used LINQ more and more in both production and hobby code, I've found that while the Enumerable extension methods are pretty easy to follow and work with, I continue to trip on the "almost SQL" inline LINQ C# keywords. Perhaps I've been at this too long, but my fingers just won't start a query statement with any other word than SELECT.

My personal short comings aside, I've found that LINQ has a more practical limitation, namely those "almost SQL" statements are still tightly coupled to an application's static structure. I was a big fan of the VARIANT back in the day, and have always thought that IDispatchEx never really got a chance to show its real potential outside of Microsoft apps before COM went out of style. Dynamic typing has a lot of advantages especially in increasingly large, complex, and distributed systems. Perhaps, I should switch to Python, but C# pays the bills.

Luckily, Microsoft has been adding dynamic typing features to .NET and C#. C#/.NET 4.0 is adding some interesting features: the F# Tuple type has been made part of the BCL; C# gets a dynamic keyword and access to the Dynamic Language Runtime, and the BCL gets an ExpandoObject allowing even statically typed languages like VB.NET and C# to take on some features of a Duck Typed language. The combination of dynamic and static typing within C# may be a powerful new addition, or it might end up being a Frankenstein with the worst features of both approaches. It will be interesting to see how all of this plays out over time.

But I digress. The real reason for this article is that I've always wanted to write a runtime evaluation engine. .NET 3.5 (with the addition of System.Linq.Expressions) and the dynamic typing features of C# 4.0 have provided the right set of tools. So I gave it a whirl.

What came out the other end lets you take something like this:

var result = from p in source
group p by p.Address into g 
where g.Average(p => p.Age) > 40
select new { Address = g.Key, AverageAge = g.Average(p => p.Age) };

and replace it with something like this:

var result = source.Query<Person, dynamic>("SELECT Address, Avg(Age) AS AverageAge
         FROM this GROUP BY Address HAVING AverageAge > 40")

Why would you want such a thing after Microsoft went to all that trouble creating Linq? Well most of the time you don't. Linq is awesome for the majority of applications where the data model is static and the view the application needs of it changes only slowly over time. It's for those cases where the data model is dynamic or the apps view of the data cannot be defined at compile time that I started poking around with this idea.

Background

The basis for this article begins with a previous installment: Taking LINQ to SQL in the Other Direction. That article describes the basic parsing and evaluation infrastructure used. Much of the underlying code is the same (or at least started it out the same) especially in the area of SQL parsing using the GOLD parser. A limitation in that previous code was that the data being evaluated had to be in the form of an IDictionary<string, object> and that was also how the data was returned to the caller.

It also owes some inspiration to the Dynamic LINQ example that Microsoft published as part of the VS2008 Samples collection.

Caveat Emptor

The attached code and project was created with Microsoft Visual Studio 2010 Beta 2, and some of the new features of C# and .NET 4.0, such as:

The Tuple type
The dynamic keyword
ExpandObject

The upshot is that you won't be able to play with the attached code using VS.NET 2008 or .NET 3.5.

Using the Code

The entry point in the API is a small set of extension methods that extend IEnumerable<T> and take a SQL statement in the form of a string. There are two sets of overloads:

QueryScalar - returns a single scalar value (like an int)
Query - returns an IEnumerable<T>

The SQL string is immediately turned into a Query<,> derived object (one for scalar queries, one for enumerable queries). The responsibilities of the Query classes are:

Parse the SQL text into a parse tree
Convert the parse tree into expression trees
Convert and cache the expression trees into lambda functions that can be used to evaluate the data sets

The parsing happens in the base class, and is largely the same as it was in the previous article.

Creating the evaluation functions is a matter of determining which parts of a SQL query are present and generating a lambda for each one:

protected override void OnCompile()
{
    if (SyntaxNode.GroupByClause == null)
        Select = SyntaxNode.CreateSelector<TSource, TResult>();
    else
        GroupBySelect = SyntaxNode.CreateGroupBySelector<TSource, TResult>();

    if (SyntaxNode.WhereClause != null)
        Where = SyntaxNode.WhereClause.CreateEvaluator<TSource>();

    if (SyntaxNode.OrderByClause != null)
        OrderBy = SyntaxNode.OrderByClause.CreateFunction<TSource>();

    if (SyntaxNode.HavingClause != null)
        Having = SyntaxNode.HavingClause.CreateEvaluator<TResult>();

with a resulting evaluation method that executes each lambda in turn:

public override IEnumerable<TResult> Evaluate(IEnumerable<TSource> source)
{
    if (Where != null)
        source = source.Where<TSource>(Where);

    if (OrderBy != null)
        source = OrderBy(source);

    IEnumerable<TResult> result = Enumerable.Empty<TResult>();
    if (Select != null)
        result = source.Select<TSource, TResult>(Select);
    else
        result = GroupBySelect(source);

    if (Having != null)
        result = result.Where<TResult>(Having);

    if (SyntaxNode.Columns.Distinct)
        return result.Distinct();

    return result;
}

Some Notes About the SQL Syntax

The SQL syntax is based on the SELECT portions of SQL 89's DML. INSERT, DELETE, and UPDATE abilities may come later.

In order to support both querying a type's properties and its value, there is a value() function built in. This allows queries such as the following to be differentiated:

List<string> l = new List<string>();
l.Add("don");
l.Add("phillip");
l.Add("kackman");

IEnumerable<string> result = l.Query<string>("SELECT * FROM this WHERE value() = 'don'");
IEnumerable<string> result2 = l.Query<string>("SELECT * FROM this WHERE Length > 3");

Also, the this entry in the join chain portion of the FROM clause is merely a placeholder. It refers to the same this that is passed to the extension methods or the IEnumerable<T> passed to the Evaluate methods. In a future update, I'd like to add support for joins across multiple collections, but at the moment, that capability isn't present.

Points of Interest

The most challenging part of this was creating result selectors. SQL return types are polymorphic depending on the query. A SELECT query can return a subset of the input data without a type transformation, with queries such as SELECT * FROM this WHERE Age = 40. It can return a subset of the fields from the input data type: SELECT name, age FROM this. It can return a single value in situations like SELECT Avg(Age) FROM this. Or queries can return completely transformed types that are aggregations of the input data: SELECT name, Avg(age) FROM this GROUP BY name.

The type parameters passed to the Query methods indicate both the type contained in the enumerable and the type to create and return.

public static IEnumerable<TResult> Query<TSource, TResult>
                       (this IEnumerable<TSource> enumerable, string sql)

Subselection

For a query that will return the same type as passed in, both TSource and TResult will be the same. There is an overload for this case that only takes one type parameter.

IEnumerable<int> source = TestData.GetInts();
IEnumerable<int> result = source.Query<int>("SELECT * FROM this WHERE value() > 3");

In this case, the selector is a simple identity lambda:

public static Expression<Func<TSource, TResult>> CreateIdentitySelector<TSource, TResult>()
{
    ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg");
    return Expression.Lambda<Func<TSource, TResult>>(arg, arg);
}

Single Properties

For selecting single properties from the source type, a lambda is created that returns the value of that property from each source object. This doesn't require the creation and initialization of new objects:

IEnumerable<Person> source = TestData.GetPeople();
IEnumerable<int> result = source.Query<Person, int>("SELECT Age FROM this WHERE age > 40"); 

...

public static Expression<Func<TSource, TResult>> 
       CreateSinglePropertySelector<TSource, TResult>(string propertyName)
{
    ParameterExpression param = Expression.Parameter(typeof(TSource), "item");
    return Expression.Lambda<Func<TSource, TResult>>(
           Expression.PropertyOrField(param, propertyName), param);
}

Multiple Properties

Returning multiple properties requires creating new instances of TResult and populating them with the result data. There are three approaches to doing this.

In most cases, it is expected that the return type has read/write properties for each of the fields in the SELECT statement. In this case, if the source property name is not the same as the result property, the AS keyword can be used to map the two. In the example below, the Person class has a property named Address, while OtherPerson has Location:

var result = source.Query<Person, OtherPerson>("SELECT name, address AS location FROM this")

...

public static Expression<Func<TSource, TResult>> CreateMemberInitSelector<TSource, TResult>
                (IEnumerable<string> sourceFields, IEnumerable<string> resultFields)
{
      ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg");

      // loop through all of the result fields and generate an expression
      // that will assign it from the source fields of the param
      var bindings = new List<MemberAssignment>();
      int i = 0;
      foreach (string field in resultFields)
      {
          MemberInfo member = typeof(TResult).GetPropertyOrField(field);
          MemberExpression memberExpression = 
             Expression.PropertyOrField(arg, sourceFields.ElementAt(i++));
          bindings.Add(Expression.Bind(member, memberExpression));
      }

      var init = Expression.MemberInit(Expression.New(typeof(TResult)), bindings);
      return Expression.Lambda<Func<TSource, TResult>>(init, arg);
}

As a side note: all property names are evaluated in a case insensitive fashion. If TSource or TResult has Property and property, exceptions will be thrown.

A special selector is created for the Tuple type which has a constructor that takes an argument for each constituent property. The order of fields in the Select statement must match the order of arguments in the constructor declaration.

var result = source.Query<Person, Tuple<string, string>>("SELECT name, address FROM this")

...

public static Expression<Func<TSource, TResult>> 
       CreateConstructorCallSelector<TSource, TResult>
      (IEnumerable<string> fields, Type[] constructorTypes)
{
      ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg");

      var bindings = fields.Select(field => Expression.PropertyOrField(arg, field));
      // the values that will intialize a TResult

      ConstructorInfo constructor = typeof(TResult).GetConstructor(constructorTypes);
      // the constructor for a new TResult

      NewExpression _new = Expression.New(constructor, bindings);
      return Expression.Lambda<Func<TSource, TResult>>(_new, arg);
}

Dynamic and Expando

Another special selector is created when dynamic is specified as TResult. In this case, you will always get a collection of ExpandoObjects back. ExpandoObject implements IDictionary<string, object> to store the set of dynamically assigned properties, and this interface is used to populate the return objects. It is via this mechanism that this API goes from statically to dynamically typed. Something I noticed about the ExpandoObject is that its property names are case sensitive. I don't know if that's good or bad, but for some reason, I expected them not to be, since it would seem to mesh more with a dynamically typed environment.

var result = source.Query<Person, dynamic>("SELECT age, address FROM this")

...

public static Expression<Func<TSource, TResult>> CreateExpandoSelector<TSource, TResult>
                      (IEnumerable<string> sourceFields, IEnumerable<string> resultFields)
{
      ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg");

      // loop through all of the result fields and generate an expression that will 
      // add a new property to the result expando object using its IDictionary interface
      var bindings = new List<ElementInit>();
      MethodInfo addMethod = typeof(IDictionary<string, object>).GetMethod(
                        "Add", new Type[] { typeof(string), typeof(object) });
      int i = 0;
      foreach (string field in resultFields)
      {
          MemberExpression memberExpression = 
             Expression.PropertyOrField(arg, sourceFields.ElementAt(i++));
          bindings.Add(Expression.ElementInit(addMethod, 
                   Expression.Constant(field), 
                   Expression.Convert(memberExpression, typeof(object))));
      }

      var expando = Expression.New(typeof(ExpandoObject));
      return Expression.Lambda<Func<TSource, TResult>>(
             Expression.ListInit(expando, bindings), arg);
}

Grouping

By far the most challenging selector was GROUP BY. This involves not only a type transformation between TSource and TResult but the calculation of properties on the return type as opposed to only assignment. It took me a while to wrap my head around how to do this without calculating and caching some intermediate state for each aggregate. In the end, I created a type, GroupByCall, to cache the delegate for each aggregate at compile time for later invocation during evaluation.

var result = source.Query<Person, Family>(
   "SELECT Address, Avg(Age) AS AverageAge FROM this GROUP BY Address")

...

public static Expression<Func<IEnumerable<TSource>, IEnumerable<TResult>>> 
                    CreateGroupBySelector<TSource, TResult>(string groupByField,
                    Type keyType, IEnumerable<AggregateNode> aggregates)
{
      // create the key selector
      // the parameter passsed to the keySelector
      var keyLambdaArg = Expression.Parameter(typeof(TSource), "keyLambdaArg");      
      var keyLambda = Expression.Lambda(
                 Expression.PropertyOrField(keyLambdaArg, groupByField), keyLambdaArg);

      // the grouped subset passed to resultSelector
      var group = Expression.Parameter(typeof(IEnumerable<TSource>), "group");
            
      // create an object to cache some state for the result selector
      GroupByCall<TSource, TResult> groupingCall = 
                 new GroupByCall<TSource, TResult>(groupByField);
      // for each aggregate in the query create a lambda expression
      // and add it to the cache
      foreach (AggregateNode aggregate in aggregates)
      {
          var aggregateExpression = aggregate.GetCallExpression(
              typeof(IEnumerable<TSource>), typeof(TSource), group);
          groupingCall.Aggregates.Add(aggregate.Alias, 
              Expression.Lambda(aggregateExpression, group).Compile());
      }

      // create the call to the result selector
      var key = Expression.Parameter(keyType, "key");
      var groupingFunc = Expression.Call(Expression.Constant(groupingCall), 
                           "GroupingFunc", new Type[] { keyType }, key, group);
      var resultSelectorLambda = Expression.Lambda(groupingFunc, key, group);

      // package all of that up in a call to Enumerable.GroupBy
      var data = Expression.Parameter(typeof(IEnumerable<TSource>), "data"); // the input data
      var groupByExpression = Expression.Call(typeof(Enumerable), "GroupBy",
                     new Type[] { typeof(TSource), keyType, typeof(TResult) }, 
                     data, keyLambda, resultSelectorLambda);

      // create the lambda
      return Expression.Lambda<Func<IEnumerable<TSource>, 
                   IEnumerable<TResult>>>(groupByExpression, data);
}

Joining

The updated code also supports simple inner joins between to IEnumerables. Given that the outer enumerable needed to be identified in the SQL test I invented a "that" keyword. Perhaps a little hokey but it satisfied my curiosity that this could be done.

[TestMethod]
public void JoinAndGroup()
{
    IEnumerable<Person> source = TestData.GetPeople();
    IEnumerable<Family> families = TestData.GetFamilies();

    var answer = from p in source
                 join f in families on p.Address equals f.Address
                 group f by f.Name into g
                 select new Tuple<string, int>(g.Key, g.Count());

    var result = source.Query<Person, Family, Tuple<string, int>>("SELECT that.Name, COUNT(*)" + 
          "FROM this INNER JOIN that ON this.Address = that.Address GROUP BY that.Name", families);

    Assert.IsTrue(result.Any());
    Assert.IsTrue(answer.Any());
    Assert.IsTrue(result.SequenceEqual(answer));
}

Conclusion

The attached unit tests contain plenty of examples for different combinations of the above, but most of the basic syntax of SQL SELECT should work. I've had a lot of fun with this code thus far, and am planning on updating it with additional functionality. Hopefully, you find it useful or at the very least interesting.

History

Initial upload - 11/11/2009.
Added joins, bug fixes, misc other stuff - 5/8/2013
Fixed some bugs with nullable type comparisons - 5/25/2013
Added ability to use ExpandoObject as enumerable source and join target - 8/2/2013

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)