Using a DataReader like a List in F#
This article demonstrates a technique how you can write recursive algorithms in F# using a Reader instead of a list, so you don't have to load all your data into memory first
Introduction
A lot of algorithms in F# use the List
datastructure and recursion to solve tasks. This is very well for smaller datasets, but when they get bigger, the overhead of first converting them to a list can become significant if not prohibitive. This article presents a simple technique on how an IDataReader
can be used just like you would a List
, letting you stream large datasets across your algorithm, so it will use far less memory.
Background
When you compare the List
and DataReader
, they're actually pretty similar in how they work. A list
has a Head
that is the current element and a Tail
with the other elements. The DataReader
has a current element and a set of next elements that you can access using MoveNext()
. This made me wonder if you couldn't just use a DataReader
like a list
and do away with first loading the whole thing into memory in a List
. Turns out, it's pretty easy.
Update
I wrote this article when I was relatively new to F# and still rather dogmatic about functional programming, I don't do this anymore. To me, the elegance of F# is that you use functional coding where it fits best, and can mix in imperative coding where that just works better than purely functional code. IEnumerables
in F# are handled by sequences that are well known. So the way to go is to wrap your IDataReader
in a seq
that exposes the elements in a strongly typed way.
I'll leave the rest of the article because tail recursion is still a useful concept to learn and the internet should be written in ink, but here's a quick snippet of how I would do this today:
open System.Data.SqlClient
open System.Data
// a simple record that mirrors the table in the DB
type SomeRecord = {
id : int
val1 : double
val2 : double
} with
// actually these functions could exist on their own outside the record,
// I prefer to keep them together
// create a new record from the current rdr position
static member fromRdr(rdr:IDataReader) = {
id = rdr.GetInt32 0;
val1 = rdr.GetDouble 1;
val2 = rdr.GetDouble 2
}
// project an IDataReader into a seq<SomeRecord>
static member asSeq (rdr:IDataReader) = seq { while rdr.Read()
do yield SomeRecord.fromRdr rdr} // here we mix in a simple while loop to
//free us from recursing down the stack to traverse the IDataReader. MUCH simpler.
// open up connection (I didn't wrap any of this into functions here to just
// show how to use this)
let cn = new SqlConnection "server=.;database=SomeDatabase;integrated security=true"
cn.Open()
// open DataReader
let cmd = new SqlCommand("select id, val1, val2 from table1", cn)
let rdr = cmd.ExecuteReader()
rdr
|> SomeRecord.asSeq // project datareader into seq<SomeRecord>
|> Seq.sumBy (fun r-> r.val1 * r.val2) // now you can use all the Seq operators
// that everybody knows.
// finish up
rdr.Close()
cn.Close()
This is the rest of the original article...
A Textbook F# Recursive Algorithm
Let's begin by looking at a basic F# recursive algorithm using a List
. This one just takes a list of numbers and adds them up.
// create a list of numbers
let l = [1..10]
// create a function to add them together
let rec SumList l =
match l with
head::tail -> head + Sum tail // peel off the head, and add
// its value to the value of the tail
| [] -> 0 // until there are no more elements left,
// return zero for that
// call the function on the list, returns 55 in this case
SumList l
This is a typical example of how recursion is used in F# to elegantly describe an algorithm. The way the numbers are added here is the list is divided into two parts: the first element: the head, and the other elements: the tail. The value of the whole list is the value of the head plus the value of the tail. Duh. The function calls itself again with the tail (which each time is one element shorter than the list passed in) until there is no more tail. That's what the []
stands for, the empty list. When it comes across that, it returns 0
. Now it goes back up the stack adding up all the numbers and tadaa: the result.
Typically when you'd write an algorithm like the one above, the data isn't hardcoded in there, you load that data from somewhere into a list and then apply the logic. The problem with this is however that now that entire set has to be loaded into memory. This can be a challenge when the dataset is big. It would be nicer on memory if we could load only a portion of that data into memory to run the algorithm on, dispose of that and then read the next portion.
Doing That with a DataReader
Here's the same algorithm, but now using a DataReader
instead of a list
. (Note: Do not run off now and use this code in production because it will fail miserably as we will demonstrate).
let rec SumReader (dr:IDataReader) =
match dr.Read() with
true -> dr.GetInt32(1) + SumReader dr
| false -> 0
This works pretty much the same as the example above. It accepts an IDataReader
, but instead of splitting that into a head and a tail, it calls movenext
on it. That returns true
if there is a new current element, and false
if not. If there was a next element, it adds the value of that to the value of the rest of the elements in the reader. Just like the example above, it keeps doing that until there is no next element, then it unwinds the stack to add all the numbers together.
Oops... No Tail Call
Using the algorithm above works fine for smaller datasets, but throws a StackOverflowException
on larger ones. As far as Exceptions go, that's about as bad as they come. You can't catch them, when one is thrown in your code, it's lights out. The whole point of using a reader over a list was to be able to handle larger datasets, so we're clearly not there yet.
The reason we get this exception is because we're not making a tail call. F# uses a technique called Tail Call Optimization (TCO, explained here, here and here) to make code run more efficiently. Basically, what that does is take a recursive function and rewrite that as an iterative one. When the original function meets a couple of criteria, this can be done pretty easily, the major criterion being that the call to itself must be the last thing the function does in that branch. When rewritten, the function now no longer calls itself but does a loop, which is more efficient because there are no more pushes to, and pops from, the stack. Pushes to the stack is what killed us, TCO is what we want.
If you look carefully at the code of our DataReader
example, you see that the recursive call to itself is actually not the last thing it does, even when it's the last thing on that line, because it adds the result of that call to the value of the current element. THAT is the last thing it does and that makes it impossible for F# to rewrite it as a tail call. (Thanks to Tomas Petricek for pointing that out). So what can we do about it?
Enter the Accumulator Pattern
What we need to do is pass an extra variable that holds the result so far. Each time we add the value of the current element to this result so far and then pass that to the next call. This way, we don't have to do anything after the call returns. This technique is called the accumulator pattern. When we do it like that, the call to itself IS the last thing it does and the F# compiler can do the Tail Call Optimization that will let us run larger datasets across. Here's what that looks like:
let rec SumReaderAcc (dr:IDataReader, accum) =
match dr.Read() with
true -> SumReaderAcc(dr, accum + dr.GetInt32(1)) // now this is the last thing
// it does, enabling TCO
|false -> accum // when there are no more elements,
// return the accumulator
It's ok now to run off and fix your production issue, but if you stick around a little longer I'll show you how you can turn this into a more generic approach that you can reuse in your code. In general, typing is bad, reuse is good.
Making Things a Bit More Generic
So far the functionality of this code is pretty fixed, it adds the first column of the query. Since F# is a functional language, why not pass a function to be applied to every element in the reader to get a value (functional dependency injection, so geeky). That way, it will be a bit more generic already. Here's the code for that:
// passing the function to apply to each element as a parameter
// already makes it more generic
let rec SumReaderF (dr:IDataReader, func, accum) =
match dr.Read() with
true -> SumReaderF(dr, func, accum + func(dr)) // pass reader and func
// on to next call, and also apply func to element and add to accumulator
|false -> accum
// create a function that takes an IDataReader and returns an int
let myAdd(element:IDataReader) = element.GetInt32(1)
// apply the function to each element in the reader
SumReaderF(rdr, myAdd, 0)
Now that's better already, but we can still do better. For one, the user now has to enter a starting zero for accum
for this to work, and it only adds up the calls to the elements, no way to for instance multiply or average. The code below fixes those two issues and is as far as we will go today. It uses an inner function to handle the recursion and accumulator, so the user only has the pass the reader and the functions. There are two functions passed to it: one function that is applied to the current element of the IDataReader
to extract a value, and a second aggregate function that is applied to the result of the element function and the accumulator.
let ApplyFuncToReader (dr:IDataReader, func, agg) =
let rec ApplyInternal (accum) = // create inner function to handle
//the recursion and accumulator.
// No need to pass everything for each call,
// just the accumulator will do
match dr.Read() with
true -> ApplyInternal(agg(accum, func(dr))) // apply func to element
// and agg to func
|false ->
dr.Close() // be nice, close the reader
accum // and return the result
ApplyInternal 0 // and execute that inner function passing 0
// as starting value for accumulator
Using the Code
Here's an example of how to call the code above:
// create a function to apply to each element
let myElementFunction(element:IDataReader) = element.GetInt32(1)
// create an aggregate function to apply between each call to elements
let myAggregate(x, y) = x + y
// create a SqlDataReader (NewReader is a helper function
// I wrote that returns an open SqlDataReader)
let rdr = NewReader()
// and pass all three to ApplyFuncToReader
ApplyFuncToReader (rdr, myElementFunction, myAggregate)
First we create 2 functions, one to be applied to every element in the IDataReader
to extract a value and a second that aggregates the calls to the first function. Here we just add them up, but we could also multiply them or do something more complicated. We don't have to change the implementation of ApplyFuncToReader
itself to change its behaviour.
Points of Interest
The code above will work fine in most situations encountered in daily work, especially when using F# interactive. When used in multithreaded code, it's not exactly the same as using a List
because someone else could be fiddling with your Reader
making your results unreliable. Although the resultset
is protected by the SQL transaction of your choice, the Reader
itself could be advanced to a next element outside of your control. These things can be easily avoided obviously but it is important to understand that you are not protected by immutability like you would be with a List
.
I went for an IDataReader
here because databases is typically where my data lives. The same technique can of course be applied to a FileReader
or StreamReader
when your data is stored in a file on disk, or comes to you from the cloud in a stream. You no longer need to buffer it all into memory before you can work on it and that's a big plus for large datavolumes. I hope this helps you as it helps me. If it does, please drop me a line or rate my article.
History
- 21-07-2010: Version 1.0
- 07-10-2011: Updated article