Have you ever wanted to execute simple SQL SELECT statements against in-memory collections and lists? Well, I have, and this article presents a mechanism to do that utilizing LINQ and runtime generated and compiled Lambda Expressions. The result is support for runtime query evaluations like:
SELECT
var result = source.Query<Person, Tuple<string, double>>( "SELECT Address, Avg(Age) FROM this GROUP BY Address"); var result2 = source.Query<Person, Family>( "SELECT Address, Avg(Age) AS AverageAge FROM this GROUP BY Address"); var result3 = source.Query<Person>("SELECT * FROM this ORDER BY age"); var result4 = source.Query<Person, string>("SELECT DISTINCT address FROM this")
I'm a big fan of LINQ, especially LINQ to Objects. Code that might be tens of lines long using a foreach iteration and evaluation can often be shortened to 1 or 2 lines. Anything that reduces code lines is a big plus for maintainability and quality. As I've used LINQ more and more in both production and hobby code, I've found that while the Enumerable extension methods are pretty easy to follow and work with, I continue to trip on the "almost SQL" inline LINQ C# keywords. Perhaps I've been at this too long, but my fingers just won't start a query statement with any other word than SELECT.
foreach
Enumerable
My personal short comings aside, I've found that LINQ has a more practical limitation, namely those "almost SQL" statements are still tightly coupled to an application's static structure. I was a big fan of the VARIANT back in the day, and have always thought that IDispatchEx never really got a chance to show its real potential outside of Microsoft apps before COM went out of style. Dynamic typing has a lot of advantages especially in increasingly large, complex, and distributed systems. Perhaps, I should switch to Python, but C# pays the bills.
VARIANT
IDispatchEx
Luckily, Microsoft has been adding dynamic typing features to .NET and C#. C#/.NET 4.0 is adding some interesting features: the F# Tuple type has been made part of the BCL; C# gets a dynamic keyword and access to the Dynamic Language Runtime, and the BCL gets an ExpandoObject allowing even statically typed languages like VB.NET and C# to take on some features of a Duck Typed language. The combination of dynamic and static typing within C# may be a powerful new addition, or it might end up being a Frankenstein with the worst features of both approaches. It will be interesting to see how all of this plays out over time.
Tuple
dynamic
ExpandoObject
But I digress. The real reason for this article is that I've always wanted to write a runtime evaluation engine. .NET 3.5 (with the addition of System.Linq.Expressions) and the dynamic typing features of C# 4.0 have provided the right set of tools. So I gave it a whirl.
System.Linq.Expressions
What came out the other end lets you take something like this:
var result = from p in source group p by p.Address into g where g.Average(p => p.Age) > 40 select new { Address = g.Key, AverageAge = g.Average(p => p.Age) };
and replace it with something like this:
var result = source.Query<Person, dynamic>("SELECT Address, Avg(Age) AS AverageAge FROM this GROUP BY Address HAVING AverageAge > 40")
Why would you want such a thing after Microsoft went to all that trouble creating Linq? Well most of the time you don't. Linq is awesome for the majority of application where the data model is static and the view the application needs of it changes only slowly over time. It's for those cases where the data model is dynamic or the apps view of the data cannot be defined at compile time that I started poking around with this idea.
The basis for this article begins with a previous installment: Taking LINQ to SQL in the Other Direction. That article describes the basic parsing and evaluation infrastructure used. Much of the underlying code is the same (or at least started it out the same) especially in the area of SQL parsing using the GOLD parser. A limitation in that previous code was that the data being evaluated had to be in the form of an IDictionary<string, object> and that was also how the data was returned to the caller.
IDictionary<string, object>
It also owes some inspiration to the Dynamic LINQ example that Microsoft published as part of the VS2008 Samples collection.
The attached code and project was created with Microsoft Visual Studio 2010 Beta 2, and some of the new features of C# and .NET 4.0, such as:
ExpandObject
The upshot is that you won't be able to play with the attached code using VS.NET 2008 or .NET 3.5.
The entry point in the API is a small set of extension methods that extend IEnumerable<T> and take a SQL statement in the form of a string. There are two sets of overloads:
IEnumerable<T>
string
QueryScalar
int
Query
The SQL string is immediately turned into a Query<,> derived object (one for scalar queries, one for enumerable queries). The responsibilities of the Query classes are:
Query<,>
The parsing happens in the base class, and is largely the same as it was in the previous article.
Creating the evaluation functions is a matter of determining which parts of a SQL query are present and generating a lambda for each one:
protected override void OnCompile() { if (SyntaxNode.GroupByClause == null) Select = SyntaxNode.CreateSelector<TSource, TResult>(); else GroupBySelect = SyntaxNode.CreateGroupBySelector<TSource, TResult>(); if (SyntaxNode.WhereClause != null) Where = SyntaxNode.WhereClause.CreateEvaluator<TSource>(); if (SyntaxNode.OrderByClause != null) OrderBy = SyntaxNode.OrderByClause.CreateFunction<TSource>(); if (SyntaxNode.HavingClause != null) Having = SyntaxNode.HavingClause.CreateEvaluator<TResult>();
with a resulting evaluation method that executes each lambda in turn:
public override IEnumerable<TResult> Evaluate(IEnumerable<TSource> source) { if (Where != null) source = source.Where<TSource>(Where); if (OrderBy != null) source = OrderBy(source); IEnumerable<TResult> result = Enumerable.Empty<TResult>(); if (Select != null) result = source.Select<TSource, TResult>(Select); else result = GroupBySelect(source); if (Having != null) result = result.Where<TResult>(Having); if (SyntaxNode.Columns.Distinct) return result.Distinct(); return result; }
The SQL syntax is based on the SELECT portions of SQL 89's DML. INSERT, DELETE, and UPDATE abilities may come later.
INSERT
DELETE
UPDATE
In order to support both querying a type's properties and its value, there is a value() function built in. This allows queries such as the following to be differentiated:
value()
List<string> l = new List<string>(); l.Add("don"); l.Add("phillip"); l.Add("kackman"); IEnumerable<string> result = l.Query<string>("SELECT * FROM this WHERE value() = 'don'"); IEnumerable<string> result2 = l.Query<string>("SELECT * FROM this WHERE Length > 3");
Also, the this entry in the join chain portion of the FROM clause is merely a placeholder. It refers to the same this that is passed to the extension methods or the IEnumerable<T> passed to the Evaluate methods. In a future update, I'd like to add support for joins across multiple collections, but at the moment, that capability isn't present.
this
FROM
Evaluate
The most challenging part of this was creating result selectors. SQL return types are polymorphic depending on the query. A SELECT query can return a subset of the input data without a type transformation, with queries such as SELECT * FROM this WHERE Age = 40. It can return a subset of the fields from the input data type: SELECT name, age FROM this. It can return a single value in situations like SELECT Avg(Age) FROM this. Or queries can return completely transformed types that are aggregations of the input data: SELECT name, Avg(age) FROM this GROUP BY name.
SELECT * FROM this WHERE Age = 40
SELECT name, age FROM this
SELECT Avg(Age) FROM this
SELECT name, Avg(age) FROM this GROUP BY name
The type parameters passed to the Query methods indicate both the type contained in the enumerable and the type to create and return.
public static IEnumerable<TResult> Query<TSource, TResult> (this IEnumerable<TSource> enumerable, string sql)
For a query that will return the same type as passed in, both TSource and TResult will be the same. There is an overload for this case that only takes one type parameter.
TSource
TResult
IEnumerable<int> source = TestData.GetInts(); IEnumerable<int> result = source.Query<int>("SELECT * FROM this WHERE value() > 3");
In this case, the selector is a simple identity lambda:
public static Expression<Func<TSource, TResult>> CreateIdentitySelector<TSource, TResult>() { ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg"); return Expression.Lambda<Func<TSource, TResult>>(arg, arg); }
For selecting single properties from the source type, a lambda is created that returns the value of that property from each source object. This doesn't require the creation and initialization of new objects:
IEnumerable<Person> source = TestData.GetPeople(); IEnumerable<int> result = source.Query<Person, int>("SELECT Age FROM this WHERE age > 40"); ... public static Expression<Func<TSource, TResult>> CreateSinglePropertySelector<TSource, TResult>(string propertyName) { ParameterExpression param = Expression.Parameter(typeof(TSource), "item"); return Expression.Lambda<Func<TSource, TResult>>( Expression.PropertyOrField(param, propertyName), param); }
Returning multiple properties requires creating new instances of TResult and populating them with the result data. There are three approaches to doing this.
In most cases, it is expected that the return type has read/write properties for each of the fields in the SELECT statement. In this case, if the source property name is not the same as the result property, the AS keyword can be used to map the two. In the example below, the Person class has a property named Address, while OtherPerson has Location:
AS
Person
Address
OtherPerson
Location
var result = source.Query<Person, OtherPerson>("SELECT name, address AS location FROM this") ... public static Expression<Func<TSource, TResult>> CreateMemberInitSelector<TSource, TResult> (IEnumerable<string> sourceFields, IEnumerable<string> resultFields) { ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg"); // loop through all of the result fields and generate an expression // that will assign it from the source fields of the param var bindings = new List<MemberAssignment>(); int i = 0; foreach (string field in resultFields) { MemberInfo member = typeof(TResult).GetPropertyOrField(field); MemberExpression memberExpression = Expression.PropertyOrField(arg, sourceFields.ElementAt(i++)); bindings.Add(Expression.Bind(member, memberExpression)); } var init = Expression.MemberInit(Expression.New(typeof(TResult)), bindings); return Expression.Lambda<Func<TSource, TResult>>(init, arg); }
As a side note: all property names are evaluated in a case insensitive fashion. If TSource or TResult has Property and property, exceptions will be thrown.
Property
property
A special selector is created for the Tuple type which has a constructor that takes an argument for each constituent property. The order of fields in the Select statement must match the order of arguments in the constructor declaration.
Select
var result = source.Query<Person, Tuple<string, string>>("SELECT name, address FROM this") ... public static Expression<Func<TSource, TResult>> CreateConstructorCallSelector<TSource, TResult> (IEnumerable<string> fields, Type[] constructorTypes) { ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg"); var bindings = fields.Select(field => Expression.PropertyOrField(arg, field)); // the values that will intialize a TResult ConstructorInfo constructor = typeof(TResult).GetConstructor(constructorTypes); // the constructor for a new TResult NewExpression _new = Expression.New(constructor, bindings); return Expression.Lambda<Func<TSource, TResult>>(_new, arg); }
Another special selector is created when dynamic is specified as TResult. In this case, you will always get a collection of ExpandoObjects back. ExpandoObject implements IDictionary<string, object> to store the set of dynamically assigned properties, and this interface is used to populate the return objects. It is via this mechanism that this API goes from statically to dynamically typed. Something I noticed about the ExpandoObject is that its property names are case sensitive. I don't know if that's good or bad, but for some reason, I expected them not to be, since it would seem to mesh more with a dynamically typed environment.
var result = source.Query<Person, dynamic>("SELECT age, address FROM this") ... public static Expression<Func<TSource, TResult>> CreateExpandoSelector<TSource, TResult> (IEnumerable<string> sourceFields, IEnumerable<string> resultFields) { ParameterExpression arg = Expression.Parameter(typeof(TSource), "arg"); // loop through all of the result fields and generate an expression that will // add a new property to the result expando object using its IDictionary interface var bindings = new List<ElementInit>(); MethodInfo addMethod = typeof(IDictionary<string, object>).GetMethod( "Add", new Type[] { typeof(string), typeof(object) }); int i = 0; foreach (string field in resultFields) { MemberExpression memberExpression = Expression.PropertyOrField(arg, sourceFields.ElementAt(i++)); bindings.Add(Expression.ElementInit(addMethod, Expression.Constant(field), Expression.Convert(memberExpression, typeof(object)))); } var expando = Expression.New(typeof(ExpandoObject)); return Expression.Lambda<Func<TSource, TResult>>( Expression.ListInit(expando, bindings), arg); }
By far the most challenging selector was GROUP BY. This involves not only a type transformation between TSource and TResult but the calculation of properties on the return type as opposed to only assignment. It took me a while to wrap my head around how to do this without calculating and caching some intermediate state for each aggregate. In the end, I created a type, GroupByCall, to cache the delegate for each aggregate at compile time for later invocation during evaluation.
GROUP BY
GroupByCall
var result = source.Query<Person, Family>( "SELECT Address, Avg(Age) AS AverageAge FROM this GROUP BY Address") ... public static Expression<Func<IEnumerable<TSource>, IEnumerable<TResult>>> CreateGroupBySelector<TSource, TResult>(string groupByField, Type keyType, IEnumerable<AggregateNode> aggregates) { // create the key selector // the parameter passsed to the keySelector var keyLambdaArg = Expression.Parameter(typeof(TSource), "keyLambdaArg"); var keyLambda = Expression.Lambda( Expression.PropertyOrField(keyLambdaArg, groupByField), keyLambdaArg); // the grouped subset passed to resultSelector var group = Expression.Parameter(typeof(IEnumerable<TSource>), "group"); // create an object to cache some state for the result selector GroupByCall<TSource, TResult> groupingCall = new GroupByCall<TSource, TResult>(groupByField); // for each aggregate in the query create a lambda expression // and add it to the cache foreach (AggregateNode aggregate in aggregates) { var aggregateExpression = aggregate.GetCallExpression( typeof(IEnumerable<TSource>), typeof(TSource), group); groupingCall.Aggregates.Add(aggregate.Alias, Expression.Lambda(aggregateExpression, group).Compile()); } // create the call to the result selector var key = Expression.Parameter(keyType, "key"); var groupingFunc = Expression.Call(Expression.Constant(groupingCall), "GroupingFunc", new Type[] { keyType }, key, group); var resultSelectorLambda = Expression.Lambda(groupingFunc, key, group); // package all of that up in a call to Enumerable.GroupBy var data = Expression.Parameter(typeof(IEnumerable<TSource>), "data"); // the input data var groupByExpression = Expression.Call(typeof(Enumerable), "GroupBy", new Type[] { typeof(TSource), keyType, typeof(TResult) }, data, keyLambda, resultSelectorLambda); // create the lambda return Expression.Lambda<Func<IEnumerable<TSource>, IEnumerable<TResult>>>(groupByExpression, data); }
The updated code also supports simple inner joins between to IEnumerables. Given that the outer enumerable needed to be identified in the SQL test I invented a "that" keyword. Perhaps a little hokey but it satisfied my curiosity that this could be done.
[TestMethod] public void JoinAndGroup() { IEnumerable<Person> source = TestData.GetPeople(); IEnumerable<Family> families = TestData.GetFamilies(); var answer = from p in source join f in families on p.Address equals f.Address group f by f.Name into g select new Tuple<string, int>(g.Key, g.Count()); var result = source.Query<Person, Family, Tuple<string, int>>("SELECT that.Name, COUNT(*)" + "FROM this INNER JOIN that ON this.Address = that.Address GROUP BY that.Name", families); Assert.IsTrue(result.Any()); Assert.IsTrue(answer.Any()); Assert.IsTrue(result.SequenceEqual(answer)); }
The attached unit tests contain plenty of examples for different combinations of the above, but most of the basic syntax of SQL SELECT should work. I've had a lot of fun with this code thus far, and am planning on updating it with additional functionality. Hopefully, you find it useful or at the very least interesting.