Introduction
I have a lot of applications which use the DataGridView
component. The
DataGridView is a real work horse, though sometimes, the data it holds
can be a bit worrying! In particular, I get a lot of text from FreeDB,
where volunteers type in their CD titles. Each data enterer has their
own unique "style". Some people are fussy and
consistent, where others splatter the walls with happy typing.
After spending way too much time on corrections, I started to
wish that
the grid were more like a word processor, where you can do global
replacements. This component is an answer to that need -
something that plugs into the grid, and allows an end-user to whip data
into shape without having to do the whole thing by hand.
The GridProcessor component is a window, which works like the
Search
and Replace window in a word processor. It allows you to keep it open
while you work on the data underneath (and like the real thing, it sits
on
top of the item that it's found!)
It has four tabs for editorial and data cleaning functions,
plus a fifth tab for macros.
Find | Finds a string, and
moves to the cell
that has that string. |
Search and replace
|
Substitutes text. You
can name, save
and load specific expressions. |
Upper/lower case
| All the normal stuff,
with the
possibility of unusual things as well. Like the search and replace
commands, you can store these items.
|
Split
|
Division of strings,
assigning the
pieces to the column(s) of your choice
|
Macros
|
Cleaning data always
involves multiple
operations. To take away drudgery, macros play sets of operations in
a sequence.
|
For both philosophical and practical reasons, the testing box
is a constant. It knows which tab is visible, and conducts the
appropriate test on the string you provide.
Casing, Replacing and Regex
Most of the work revolves around replacement. Replacements can
substitute one chunk of text for another (an eye for a tooth), or they
can use regular expressions. Regular expressions let you to treat text
abstractly, so they're extremely useful. They're even more useful for
scaring people, and regular expressions can also be hard to work out
again if you forget them. Therefore, the GridProcessor lets you save
and recall expressions, and to give them names. This also makes it
easier to give the component to non-technical end-users.
I was particularly pleased with the idea of using Regex for
casing, though I'm sure I'm not the first person to think of that.
Regular expressions don't have a provision for changing case, but we
only need Regex to identify
situations where we want to change case.
After that, we can use regular string functions to do the work of
uppering and lowering. The Regex Match object tells us both the start
index and the length of the string we want to change, so it's easy to
set up once we get there.
The benefit of the approach is that not only can we do upper,
lower, and proper case, but we can also do some more curvy things. For
instance, in Spanish only the first word of a title is capitalized, and
everything else is lower case. We can make a rule called "Spanish title
case". French people like to
uppercase their surnames — «Like
THIS» —
so we
could devise a "shout-your-SURNAME case" if we wanted to. Rather
than catalogue all the things that people are likely to do, why not
just give them the ability to do it when they think of it?
Grids and Documents
While there are similarities
between grids and documents (hence the "GridProcessor" idea), the
grid is a structured container, where a document is not. In a way, the
grid is a bunch of tiny
documents, each of which has the "subject" of its column header.
What's significant for us is that we can choose which columns get
updated, and we can leave the rest alone. In a word processing
document, we either fire blindly or confirm each possible change.
Here, we can fine-tune the changes, and thanks to the structure of the
grid, even a global replacement can be more precise.
Because each cell is a
small document, our expression logic may be slightly
different. Also,
changes are applied at the cell level, so if there are two instances
of the word "the", for instance, both are changed, unless we
specify an anchoring position. I had thought about allowing the user
to confirm each change within a cell, but it wouldn't do them any
favors when used from a macro. It would either plow through against
their wishes, or leave unfinished business without telling anyone.
It's best to get the expression right, so that it works the same
way everywhere.
By the way, some of the
changes could be
surprising, so the component has a confirmation screen which shows the
"before and after", with differences highlighted in blue.
Currently, it only does the color-coding for literal replacements,
leaving the user to work out what's changed when regular
expressions are
involved. Perhaps someone would like to submit an improved version
with a good, color coded Diff()
function.
The Specifics
On initialization, the
GridProcessor
receives a grid reference as a parameter. Using the reference, it
reads metadata from the grid, and compiles a list of columns that it
can work with. At this point the component ignores everything except
text (no numbers), and skips columns that are invisible. There's an
array of "allowable" column classes, which can be
added to if
you've devised your own type of column.
The component keeps a live
reference to
the grid, and hence to all of the grid's data. Using the grid
reference, the
window can apply changes interactively, and can also control the
location of the current cell.
At the base is a
class called GridProcess
, which holds
instructions and working
variables for transforming grid text. Search and replace
and
casing operations are both replacements, and they're
represented by a Replacement
class, which
inherits GridProcess
and adds a ReplaceExpression
member. The Split
also inherits from GridProcess
,
though its concerns are with columns. It retrieves
the relatively-stable column headings from the database, turning them
into more volatile column indices. Since it's possible for a split
result to go into
multiple columns, the result assignment works like a HashTable
,
with
an array of column indices for each member of the string that's
split. Below is a partial class diagram, made with Sacha Barber's
wonderful AutoDiagrammer tool.
There's a bit of
overkill here, especially for a small component. The main advantage
of the inheritance is that all of our objects can share the same
data, which is strikingly similar. However, the functionality is
quite different. One item evaluates an expression so we can jump to a
cell. Another item transforms text. Another breaks up text and
redistributes it. All these classes rely on external data, which is
either in the grid, or in a text box (as part of a test). There are
different needs, and each class has a very different factory
function.
It would be nice
for each object to have a Process()
function
which we could call in
sequence, but that would involve a certain amount of violence. It's
more natural to use the objects as data structures, and to program in a
way that suits the grid we're working against. Each operation has a
row function, and a simple, inner function which we can independently
from the test harness box. While not exactly at the forefront of OO
programming, at least we're tidy and symmetrical.
The inheritance scheme does pay off again when we want to play
a macro.
It's conceptually simple to have a list of commands, and to play each
one
in sequence. We compile a list of GridProcess items (our base class),
and dispense with each one in a loop.
The macro takes a
generic list of type GridProcess (the base class)
List<GridProcess> processList;
...
processList = Util.RetrieveMacroProcesses(macro.ID, mGridColumns);
for (int i = 0; i < processList.Count; i++)
{
... execute each class of rule
}
Again, we have to cast each item to the proper type and then
call the correct functions, but it's conceptually simple and it's
orderly.
The Database
The database is an
Access file, though it could be any structured database at all. I chose
Access because the driver is readily available, and also because you
can "cheat" by firing up Access itself if you need to. If you think
your users may figure out how to do this as well, you can rename the file and
extension, and supply a custom setup object. (The explanation is in the
Quick Start section, below.)
Data
retrieval is set up through OLEDB, so it would be extremely easy to
port to another DBMS. Wherever possible I use the "official" combination of
DataAdapter
s and DataSet
s,
even though they have a mind of their own at times.
This is how the database looks:
Basically, the
GridProcess table stores any type of process for transforming data.
The Macro table stores a sequence of processes.
The main database
issues were the usual .NET ones:
-
Telling the
application
where the data file is. The wizards like to hard-code
locations. This
application has a "Settings" class, which calls a function in the Utils
class to find the main application directory. The function strips off
the "bin\debug" stuff, so that it will look in the
same place both for development and production.
-
Saving
strings directly
with SQL commands. (When it makes more sense than using a DataAdapter
.)
Common characters such as
apostrophes
can make a
database throw a wobbly, and this varies between DBMSes. Even though
it's a pain, it's best to use parameter objects, and let
the
driver-writers worry about all the things that could go wrong.
-
Resources.
At times, The .NET
wizards store their select
,
update
and delete
statements as resources inside a form.
This is a major headache when changing data structures, and in any
case, seems like an odd
thing to do. Hopefully, I've included absolutely
everything, and no further changes are necessary.
- Getting an ID
number from a sequence. (Not a .NET issue, but a general
one.) The app
re-selects the item it inserted to get the new ID.
Quick start
The component is
meant to be pluggable. It automatically looks for a file called
GridProcessor.mdb in the same folder as your application, and is
clever enough to find it from your bin/debug directory during
development.
The best way to
use it is to declare a form-wide variable, set to null:
GridProcessor.FormGridProcessor mGridProcessor = null;
and then to
initialize it only if that hasn't been done before. That way it
remembers where it was every time you Show()
it.
if (mGridProcessor == null)
mGridProcessor = new GridProcessor.FormGridProcessor(dataGridView_myGrid);
mGridProcessor.Show(this);
If you have another location, another filename,
or another connection string, you can pass a
GridProcessor.Settings
object, using an
alternate constructor.
GridProcessor.Settings mySettings = new GridProcessor.Settings();
mySettings.DataDirectory = @"D:\MyDirectory";
mySettings.FileName = "OtherFile.abc";
mySettings.ConnectionString = "Fripping lyshus wimbgunts, awhilst moongrovenly
kormzibs. Bleem miserable venchit! Bleem forever mestinglish asunder
frapt!";
...
if (mGridProcessor == null)
mGridProcessor = new GridProcessor.FormGridProcessor(dataGridView_myGrid,
mySettings);
Using a custom setting, you can have multiple files
for
different grids
in the same application, a single file for an application, or, if you
like, a central file for everything (though there's no code for
record-locking.)
The grid processor form also has a Grid
property, which is the reference that the component uses.
There's a fair bit of danger involved in
exposing this, but it could be helpful in MDI applications. If you have
a
series of windows with the exact same grid, you can set the
GridProcessor to operate on the topmost window.
Further Work
There are a lot
more that can be done with such a component. Fields could be
concatenated, or used in decision logic. There could be a settings
file for the screen as a whole, with default columns, rows, and the
preferred tab. There could be security settings to prevent tampering
with saved items. There could be a library of functions which you
could embed inside a replacement expression. The component could deal
with numbers. And most important, there are the things I
haven't thought of, that are perfectly obvious to other people who use
it.
There's no reason any of this couldn't be done, excepting
time, of
course, and loads of work. The GridProcessor is a project on
SourceForge, which you're welcome to join. It's
here.
Conclusion
This component was a typical answer to a time-consuming
situation — to step back and let the computer do the work. Of
course, that took a bit of time in itself (call it "administrative
overhead"), but the programming paid for itself during the first day of
use, and saved some horrifically tedious typing. More than that, the
code was immediately useful in several other projects, and took only a
few lines to set up. The big
job, as always, was whip the code into shape for public consumption.
That said, I do hope that other people find this component helpful. (If
not, I'll gladly refund the full price
they paid for it!)
Please let me know what you think. I'd be
interested in knowing where the component ended up, what sort of work
it's doing, and which operations and expressions turned out to be the
most useful. And if you like it, don't forget to vote.
History
03 May 2007 - First submission.