![]() |
Platforms, Frameworks & Libraries »
LINQ »
General
Intermediate
License: The Code Project Open License (CPOL)
SqlLinq: Taking LINQ to SQL in the Other DirectionBy Don KackmanParsing SQL statements to create LINQ Expressions. |
SQL, C# 3.0.NET 3.5, LINQ, Architect, DBA, Dev
|
|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||

I've wanted to write an application that would allow me to execute queries against my music collection for some time now. Having been burning my CDs to MP3 and WMA for some years now, I have thousands of songs, and I've been looking for something a bit more robust than what most music players I've used (like the Windows Media Player) allow. With the introduction of LINQ in .NET 3.5, this seemed like a good opportunity to create an app I've been thinking about for a while, and at the same time dig into a new technology. This set of code is the result.
My ultimate goal is to create a set of libraries that can be used to allow for executing SQL queries against any data source that can represent itself in a relational, semi-relational, and tabular format.
11/11/2009 - An evolution of the query evaluation code is available in an updated version: Dynamically evaluated SQL LINQ queries. The updated version operates on any IEnumerable<T>, not just IDictionary<string, object>.
Since the initial posting of this article, I've continued to toy around with this set of code, and have made some modifications and additions.
DataGridView printing and print preview (thanks Salan Al-Ani)
SELECT MAX(duration / 1000))DELETE support (disabled in the linked code, but easily enabled by removing a throws statement)System.IO.File properties from media propertiesSyntaxTree namespaceThe first thing I came up against with LINQ and the vast majority of examples for its usage was they all assume implicit knowledge of the structure of the underlying data and compile time knowledge of the structure of the desired results. I'm sure this holds true for the vast majority of "line of business" applications where LINQ will really pay dividends. Creating a LINQ query to aggregate and manipulate a sales report or a product catalog page entails detailed knowledge of the data and the requirements of the resulting aggregates. Looking at LINQ to SQL, it generates SQL statements to match the runtime LINQ expression execution, and provides a very nice Object Relational Mapping tool that, in the past, required extensive infrastructure coding or third party tools.
The compiled code wouldn't know the structure of the underlying data (other than that it consists of name value pairs), nor would it know what fields or aggregations were needed, as the exact data manipulation and result structure is defined at runtime, not compile time. Building up a LINQ statement like:
EnumerableRowCollection<DataRow> query =
from order in orders.AsEnumerable()
where order.Field<bool>("OnlineOrderFlag") == true
orderby order.Field<decimal>("TotalDue")
select order;
requires knowing, at a minimum, the order and number of clauses in the where statement as well as the order by statement. So, after some initial reading and investigation into LINQ, I came to the conclusion that I needed to go the other way: SQL to LINQ rather than LINQ to SQL. That lead to some other interesting areas like needing a SQL parser and dynamic construction of complex expressions, but it did make for a good crash course in learning LINQ.
The structure of the code consists of a number of layers:
FileDbProvider - This layer contains an IDataAdapter and associated types like a derived DbConnection and DbCommand. It is the layer that the example user interface interacts with.File.PropertyProvider - This contains a reader for System.IO.File properties as well as the base types for other file property readers.Media.PropertyProvider - This layer knows how to access the tags from media files, and uses the SqlLinq layer to retrieve and evaluate SQL statements passed in from the data provider layer.SqlLinq - This layer contains the SQL parser, and exposes the resulting parse tree along with an API to allow the evaluation of arbitrary sets of data.The UI interacts with the data provider layer much like it would with any other provider:
DataSet data = new DataSet();
MusDbConnection connection = new MusDbConnection(
"root='C:\\Music';recurse=true;FileExtensions=wma,mp3");
connection.Open();
IDataAdapter adapter = new MusDataAdapter("SELECT * FROM media", connection);
adapter.Fill(data);
The DataSet can then be used just like an application would, with one retrieved from SQL Server, Oracle, Jet, or other data sources.
Most of the unit tests will fail when you run them from the attached code because they are written against a set of media files that, for obvious reasons, I haven't included with the download. They do provide basic examples of the SQL syntax that works, and should be pretty easy to point at some other set of media and made to pass again.
This assembly ties together execution with the file property readers with SQL parsing and execution. It operates much like other SQL data providers.
The connection string is of the format name=value, with a semicolon separator like most other database connection formats.
Supported connection string parameters are:
SELECT statement above, the query will search "C:\Music\soundtrack\".true | false - indicates whether to recurse directory structures during query evaluation. Defaults to true if if not provided.Aside from using the SqlLinq layer to process SQL statements, the other thing that is done in this assembly is providing an implementation of IDataAdapter and DbDataReader so that the results can be returned to an application as a System.DataSet. That is all pretty straightforward, and there are plenty of examples out there, so I won't delve into the details.
This assembly contains the base functionality for reading the properties of a file. The interface IPropertyReader is the base type that any component needs to implement in order to be included in a file property query.
public interface IFilePropertyReader
{
string Namespace { get; }
bool Match(string filePath);
IEnumerable<KeyValuePair<string, object>> Read(string path,
IEnumerable<string> selectProperties);
IEnumerable<KeyValuePair<string, object>> Read(string path);
IEnumerable<KeyValuePair<string, Type>> GetFields();
}
In addition to implementing IFilePropertyReader, there are some attributes that are used to define what file types a reader supports and define the "table name" that the reader type will bind to in a SQL statement.
By way of example, the class that basic file properties is declared in, looks like:
[FilePropertyReader("*")]
[PropertyTableName("File")]
[TypePropertyReader(typeof(FileReflectedPropertyReader))]
class FilePropertyReader : PropertyReader
The FilePropertyReader attributes will cause the query engine to use the MediaFileProperyReader to retrieve properties from any file extension. The PropertyTableName attribute above binds this reader to a logical table by the name of file; e.g., a SQL statement SELECT * FROM file. This class reads any of the properties accessible from the System.IO.File class, such as name, extension, size, etc.
Finally, this assembly dynamically loads other file property readers at runtime. It does this by reflecting over an assembly that ends with .PropertyProvider.dll, looking for classes that implement IFilePropertyReader. It does this using a dynamic TypeLoader library.
This assembly contains an implementation of IFilePropertyReader that can return media properties from meta-data tags using the Windows Media Format SDK.
[FilePropertyReader("mp3")]
[FilePropertyReader("wma")]
[FilePropertyReader("asf")]
[FilePropertyReader("wmv")]
[PropertyTableName("Media")]
class MediaFilePropertyReader : IFilePropertyReader
I started with some simple code that could read ID3 v1.1 tags, but wanted something a bit more robust (ASF container formats, other ID3 tag versions etc.). A bit of searching lead to the Windows Media Format SDK, and now the retrieval of the actual tag values is all done using WMVCore.dll and the interfaces IWMMetadataEditor, IWMMetadataEditor2, and IWMHeaderInfo3. The code that performs this retrieval started its life as part of one of the SDK samples. It's now modified to fit the needs of this application and my particular coding style. As such, I think it may be a bit more generically useful than the original sample as it doesn't do the same tag of formatting as it deserializes the tags (except for byte arrays which are only returned as string representations indicating their length). If you are looking for some media tag access code, check out the FileMediaTags class in the attached code. It should be pretty easy to yank out and bend to your needs.
The meat of retrieving the actual tags has a basic Win32 flavor to it; calling methods twice to get buffer sizes, and then passing them back in to be filled, null terminated strings etc. If you've worked with Windows APIs at all, nothing too surprising with this piece.
private static IWMHeaderInfo3 GetHeaderInfo(string path)
{
IWMMetadataEditor editor;
WMFSDKFunctions.WMCreateEditor(out editor);
IWMMetadataEditor2 editor2 = (IWMMetadataEditor2)editor;
editor2.OpenEx(path, FILE_ACCESS.GENERIC_READ, FILE_SHARE.FILE_SHARE_READ);
return (IWMHeaderInfo3)editor2;
}
private static void GetAllMediaTags(IDictionary<string, object> tags, string path)
{
IWMHeaderInfo3 headerInfo3 = GetHeaderInfo(path);
try
{
ushort wAttributeCount;
headerInfo3.GetAttributeCountEx(0, out wAttributeCount);
for (ushort wIndex = 0; wIndex < wAttributeCount; wIndex++)
{
WMT_ATTR_DATATYPE wAttribType;
string pwszName = null;
ushort wAttribNameLen = 0;
ushort pwLangIndex = 0;
uint pdwDataLength = 0;
// get the length of this attribute name
// and value in order to alloc the buffers
headerInfo3.GetAttributeByIndexEx(0,
wIndex,
pwszName,
ref wAttribNameLen,
out wAttribType,
out pwLangIndex,
null,
ref pdwDataLength);
pwszName = new String('\0', wAttribNameLen);
ReadAndAddAttribue(tags, headerInfo3, wIndex, pwszName, pdwDataLength);
}
}
finally
{
((IWMMetadataEditor)headerInfo3).Close();
Marshal.FinalReleaseComObject(headerInfo3);
headerInfo3 = null;
}
}
The tags from each media file are returned from the loader in an IDictionary<string, object>. The dictionary keys are case insensitive to match the case insensitivity of the SQL parsing. As the code iterates all of the files in the specified directory structure, it builds up an IList<IDictionary<string, object>>. This IList represents the result set from the query, with each IDictionary being analogous to a single row. The list of rows is how data is returned to the provider layer.
This is where all the good LINQ-y stuff happens! (Well, actually, there is more Linq.Expression stuff going on in here than anything else.)
SQL parsing started with a very simplistic approach using what was little more than a rudimentary tokenizer. That got me as far as the simple SELECT field FROM source part, but not much farther. A day or so of searching came up with nothing usable, until I came upon the Grammar Oriented Language Developer or GOLD application. This thing started as Devin Cook's master's thesis, and is really a remarkable tool.
What it does is take in the rules for an arbitrary language in BNF format and spit out a set of parse tables to be used later by a language specific state machine parse engine. There are even a number of engines available in multiple languages including C#. This code uses a C# engine by Vladimir Morozova, which is included in the download. There was even a specification for a simple SQL syntax available (extended slightly to add column aliases, boolean literals, and to allow HAVING clauses to contain aggregate expressions). There is also a nice article here on CodeProject that explains the GOLD application and its use in more depth. Basically, this solved my problem perfectly, and kudos to the authors of the GOLD application.
Parsing proceeds from the leaves to the trunk, building up the parse tree along the way. As the parser traverses tokens and rules, it yields up a Reduction rule which corresponds to the various parts of the language. A custom attribute is used to map C# classes to SQL rules and instantiate them as the parser does its work.
[SyntaxNode(RuleConstants.RULE_WHERECLAUSE_WHERE)]
public class WhereClause : NonTerminalNode
{
public WhereClause()
{
}
public IEnumerable<IDictionary<string, T>> Evaluate<T>(
IEnumerable<IDictionary<string, T>> source)
{
return source.Where(CreateEvaluationFunction<T>());
}
private Func<IDictionary<string, T>, bool> CreateEvaluationFunction<T>()
{
PredicateNode predicate = FindChild<PredicateNode>();
if (predicate != null)
return predicate.CreateEvaluationFunction<T>();
LiteralNode node = FindChild<LiteralNode>();
if (node != null)
return node.CreateEvaluationFunction<T>();
Debug.Assert(false);
return null;
}
}
static class SyntaxRuleFactory
{
private static TypeLoader<NonTerminalNode, int> _nodeImplTypeMap = LoadImplTypes();
public static NonTerminalNode CreateNode(Rule rule)
{
Debug.Assert(rule != null);
NonTerminalNode node = null;
if (_nodeImplTypeMap.Contains(rule.Index))
node = _nodeImplTypeMap.CreateInstance(rule.Index);
else
node = new NonTerminalNode();// if no type is bound to the rule then
// just create a base non-terminal node
node.Rule = rule;
return node;
}
private static IEnumerable<int> GetRuleIds(Type t)
{
return t.GetCustomAttributes(typeof(SyntaxNodeAttribute),
false).Select(attr => (int)((SyntaxNodeAttribute)attr).RuleConstant);
}
private static TypeLoader<NonTerminalNode, int> LoadImplTypes()
{
TypeLoader<NonTerminalNode, int> loader = new TypeLoader<NonTerminalNode, int>();
loader.SearchDirectories = false;
loader.LoadMany(GetRuleIds);
return loader;
}
}
A SQL statement that looks like SELECT * FROM soundtrack WHERE bitrate = 128000 results in a parse tree like:
<Select Stm> [Rule Id=RULE_SELECTSTM_SELECT Class=SelectStatement]
SELECT
<Columns> [Rule Id=RULE_COLUMNS_TIMES Class=Columns]
<Restriction> [Rule Id=RULE_RESTRICTION Class=NonTerminalNode]
*
<Into Clause> [Rule Id=RULE_INTOCLAUSE Class=NonTerminalNode]
<From Clause> [Rule Id=RULE_FROMCLAUSE_FROM Class=FromClause]
FROM
<Id Member> [Rule Id=RULE_IDMEMBER_ID Class=NodeWithId]
soundtrack
<Join Chain> [Rule Id=RULE_JOINCHAIN2 Class=NonTerminalNode]
<Where Clause> [Rule Id=RULE_WHERECLAUSE_WHERE Class=WhereClause]
WHERE
<Pred Exp> [Rule Id=RULE_PREDEXP_EQ Class=EqualityNode]
<Value> [Rule Id=RULE_VALUE_ID Class=NodeWithId]
bitrate
=
<Value> [Rule Id=RULE_VALUE_INTEGERLITERAL Class=IntegerLiteral]
128000
<Group Clause> [Rule Id=RULE_GROUPCLAUSE Class=NonTerminalNode]
<Having Clause> [Rule Id=RULE_HAVINGCLAUSE Class=NonTerminalNode]
<Order Clause> [Rule Id=RULE_ORDERCLAUSE Class=NonTerminalNode]
Once the parse tree is built, then it is just a matter of traversing it in order to generate the appropriate LINQ constructs or expressions appropriate to each node. For instance, in the SelectStatement (the root node for a SELECT statement), the evaluation function uses the various sub clauses with each child node, further modifying the data passed in according to its SQL semantics and its own branch of the parse tree:
public override IEnumerable<IDictionary<string, object>> Evaluate(
IEnumerable<IDictionary<string, object>> data)
{
// constrain results by where clause conditions
if (WhereClause != null)
data = WhereClause.Evaluate(data);
// calculate any aggregated values
data = EvaluateAggregates(data);
// constrain aggrated values by having conditions
if (HavingClause != null)
data = HavingClause.Evaluate(data);
// order the results
if (OrderByClause != null)
data = OrderByClause.Evaluate(data);
// and post process (remove any intermediate columns not specified in the
// select clause)
return PostProcessResults(data);
}
The OrderByClause uses the OrderBy and ThenBy extensions to do ordering. Similarly, the GroupByClause uses LINQ's grouping constructs to roll up the source data:
public IEnumerable<IDictionary<string, object>>
Evaluate(IEnumerable<IDictionary<string, object>> source,
IEnumerable<AggregateNode> aggregates)
{
IList<IDictionary<string, object>> list =
new List<IDictionary<string, object>>();
foreach (NodeWithId item in GroupByItems)
{
var groupBy = from d in source
group d by d.ContainsKey(item.LookupId) ?
d[item.LookupId] : DBNull.Value
into g
select g;
foreach (var g in groupBy)
{
IDictionary<string, object> dict =
new Dictionary<string, object>(StringComparer.InvariantCultureIgnoreCase);
dict.Add(item.LookupId, g.Key.ToString());
foreach (AggregateNode aggregate in aggregates)
dict.Add(aggregate.LookupId,
aggregate.Evaluate<object>(g.AsEnumerable()));
list.Add(dict);
}
}
return list;
}
public IEnumerable<NodeWithId> GroupByItems
{
get
{
return FindDescendants<NodeWithId>();
}
}
In order to implement the WHERE and HAVING clauses, heavy use is made of System.Linq.Expressions. This made implementing that logic pretty straightforward as most SQL operators have a direct analog in the Expression namespace. The one glaring exception is LIKE, which I've implemented using a RegEx. The main task for expression evaluation is to generate a Func<IDictionary<string, object>, bool> that can then be used with the LINQ Where<T> extension method. All of this work is done with a class hierarchy that starts with the PredicateNode. This abstract base class implements basic functionality like building expressions to index the input IDictionary, type coercion, and navigation to sub predicates and operands.
public Func<IDictionary<string, T>, bool> CreateEvaluationFunction<T>()
{
ParameterExpression param = Expression.Parameter(typeof(IDictionary<string, object>),
"arg");
//traverse the tree to generate a lambda expression and then compile into a function
return Expression.Lambda<Func<IDictionary<string, object>, bool>>
(CreateOperatorExpression(param, GetLeftExpression(param)), param).Compile();
}
protected abstract Expression CreateOperatorExpression(ParameterExpression param,
Expression left);
Because the data type of each field in the input data is not known when the expression is built, type coercion is somewhat lax. If an expression is being compared to an integer literal, for example, a Expression.Convert will be used to coerce the input to an integer (ultimately using System.Convert). If the type cannot be determined from a literal, there is a simple system of fall backs. Arithmetic expressions will default to the real domain; for instance, boolean operations defaulting to bool, and most other predicates falling back to strings.
private LiteralNode FindCoercionType(int index)
{
if (index != 0 && index != 2)
return null;
LiteralNode node = FindChild<LiteralNode>(OppositeSide(index));
// look at what the child operand is being compared to
if (node == null && (this.Index == 0 || this.Index == 2))
node = Parent.FindChild<LiteralNode>(OppositeSide(this.Index));
// look at what the whole expression is being compared to
// if we don't find any literals in the area, look for a predicate
// expression that can drive the type coercion
if (node == null)
{
PredicateNode predicate = FindChild<PredicateNode>(OppositeSide(index));
// look at what the child operand is being compared to
if (predicate == null &&
(this.Index == 0 || this.Index == 2))
predicate = Parent.FindChild<PredicateNode>(OppositeSide(this.Index));
// look at what the whole expression is being compared to
if (predicate != null)
node = predicate.GetExpressionType();
}
return node;
}
In this example usage of dynamic expression building (media file tag collections), not every row in the input data will have the same set of fields. This is because not every media file has the same set of tags embedded within it. For this reason, an extra step is taken when indexing each dictionary, first checking ContainsKey, and returning DbNull if that returns false, basically allowing the same set of fields to be evaluated for every row, and allowing every row to be conceptually NULLABLE. This pattern of first checking ContainsKey shows up in a number of places in the code.
private ConditionalExpression CreateDereferenceExpression(ParameterExpression param,
int index)
{
NodeWithId idNode = FindChild<NodeWithId>(index);
if (idNode != null)
{
// in order to avoid KeyNotFoundExceptions this will create an expression
// of the general form:
// if(dictionary.ContainsKey(key))
// return dictionary[key];
// else
// return NULL;
MethodInfo indexerMethod = typeof(IDictionary<string, object>).GetMethod(
"get_Item");
MethodInfo containsKeyMethod = typeof(IDictionary<string, object>).GetMethod(
"ContainsKey");
LiteralNode node = GetTypeCoercionNode(index); // this is used to coerce
// the value in the dictionary to
// the correct type for comparison
Expression key = Expression.Constant(idNode.EvaluationId);
Expression containsKey = Expression.Call(param, containsKeyMethod, key);
Expression returnValue = Expression.Convert(Expression.Call(param,
indexerMethod, key), node.ValueType, node.CoercionDelegate.Method);
Expression returnNull = Expression.Constant(node.NullRepresentation,
node.ValueType);
return Expression.Condition(containsKey, returnValue, returnNull);
}
return null;
}
With all of the heavy lifting done in the PredicateNode base class, each derived class just needs to provide the correct Expression in its override of CreateOperatorExpression:
[SyntaxNode(RuleConstants.RULE_PREDEXP_EQ)]
public class EqualityNode : PredicateNode
{
public EqualityNode()
{
}
protected override Expression CreateOperatorExpression(
ParameterExpression param, Expression left)
{
return Expression.Equal(left, GetRightExpression(param));
}
}
With the current implementation, evaluating a set of data is a two step process. First, the client code has to retrieve the larger set of data to be evaluated. It can use the parse tree to guide the retrieval, but the actual de-serialization out of the source data store exists entirely outside of the SqlLinq assembly. Second, it uses the SqlLinq code to evaluate the data which applies the logic for everything but the FROM clause. Eventually, I'd like to explore IQueryable<T> in order to eliminate the two part sequence. The good thing with the current two step implementation is that the SqlLinq code knows nothing about the nature of the data other than that it can be represented as an IList<IDictionary<string, object>>. The one place where this is compromised slightly is that since this test case searches folder hierarchies, the FromClause does not expose the JOIN chain as one would find with a truly relational data source. The FROM clause contains the comma separated list of "tables" to evaluate: SELECT * FROM media, file, which will return all media and file properties.
I learned a lot in writing this article, from how to get tags out of a media file, to the details parsing and parse trees, and then on to LINQ and LINQ Expressions. For a procedural/OO guy like myself, LINQ took some time to wrap my head around. I hope to keep extending this as time allows and see what other sorts of data stores can be evaluated with textual SQL statements...
Another point of interest is that the pluggable architecture makes it straightforward to add additional types of file queries. For instance, the downloadable code includes an Assembly file property reader that can be used to find assemblies that match custom search criteria.

Without too much more work, this could be turned into a very generically reusable file search and analysis application. Maybe, as I continue to tinker with it, that's what it will become.
IQueryable<T> and provide an implementation that can go directly at the data rather than requiring the client code to retrieve it and then evaluate the results.UPDATE statement.LIKE, NOT LIKE, IN, and NOT IN operators.| You must Sign In to use this message board. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 12 Nov 2009 Editor: Smitha Vijayan |
Copyright 2008 by Don Kackman Everything else Copyright © CodeProject, 1999-2009 Web22 | Advertise on the Code Project |