The common requirement in many enterprise applications is logging of data changes in a database - what data has changed, who changed them and when (audit logging). Bloggers published many articles about this old problem but there are only few general approaches on how to do that in relational databases.
Promiscuous Women Cause Problems for Database Administrators
This is a fiction story that never happened:
Juliet Capulet meets Paul Smith; they fell in love and marry on 25th August 2009. When Juliet is back at work, she asks the db admin to change her name in a company database. The admin is not happy to have additional work with this stuff but he does it for her.
Juliet's and Paul's marriage is not going well and Juliet divorces Paul after a short period. Soon Juliet feels alone, she meets Romeo Montague and, of course, falls in desperate love with him. Logically, another wedding is arranged and Juliet asks db admin to change her name again on 10th January 2010.
The tragedy of all Juliet's love relationships finishes when she finds her husband Romeo as a roulette player spending all their money. She divorces again and changes her name to Juliet Singleton on 21st July 2010 with the vision not to marry anybody again. Needless to say, the db admin is close to committing suicide because his database does not allow changing the name more than twice.
What would you suggest the database admin to do better in his next life?
Let us summarize what the db admin was asked to do with Juliet's personal data:
- Juliet Capulet gets the job on 2005-03-01
- Name Juliet Capulet changes to Juliet Smith on 2009-08-25
- Name Juliet Smith changes to Juliet Montague on 2010-01-10
- Name Juliet Montague changes to Julie Singleton on 2010-07-21
Solution No. 1: Roll off Out-dated data
The simplest solution for keeping historical data in the database is to create a duplicated record before update. After that, we can update the requested data. In case of tracing Juliet's name, we get these data in the database table:
In real database schema,
id is primary key and has to be unique. So we need an extra column containing information about row id from which data were rolled off. Additionally, it would be helpful to have a column representing status of the record. Because the situation gets more complex, here I show how data evolves in the database:
Juliet Capulet gets the job on 2005-03-01.
Name Juliet Capulet changes to Juliet Smith on 2009-08-25.
Name Juliet Smith changes to Juliet Montague on 2010-01-10.
Name Juliet Montague changes to Julie Singleton on 2010-07-21.
You can see that each change to the record with
- creating duplicated row with
- updating the column name to a requested value with
created_date set to current date.
When the active record is going to be "deleted", then only the status flag changes the value from active to history.
It is simple to implement audit logging based on rolling out old data. You do not need additional tables. Maintenance is simple, too. If you decide to remove old data, then it is the matter of one SQL.
But what if you need to implement audit logging in more tables? You need to add columns
created_by. And you have to implement previously introduced roll-off logic, either to database directly (usually with table triggers) or in the application.
The problem can be that when data changes, then the whole record is copied, i.e. also data which does not change. It causes data duplication and, potentially, database disk space can increase markedly if changes occur frequently. For example, if table
persons has a column photo with binary data containing the photography then each and every time Juliet's name changes the whole record (including the photo) is copied to rolled-off record.
Another disadvantage is that the complexity of each table supporting audit logging increases. You must have in mind all the time that retrieving the records is not simple. You always have to use the
SELECT clause with condition:
SELECT * FROM persons WHERE status='active'
When you need to roll back historical data for a particular record, you have to copy all columns from the historical record to the active record. It would be easier just to switch status value from "
history" to "
active", but it would break reference integrity if the record was referenced in other tables.
Solution No. 2: Dedicated Data-Tracing Table
Another approach is based on separate audit log table that is dedicated to logging data changes coming from all tables having audit log feature:
|1241||persons||name||124||2005-03-01||admin||Juliet Capulet||Juliet Capulet|
|1654||persons||name||124||2009-08-25||admin||Juliet Capulet||Juliet Smith|
|2547||persons||name||124||2010-01-10||admin||Juliet Smith||Juliet Montague|
|3645||persons||name||124||2010-07-21||admin||Juliet Montague||Juliet Singleton|
This solution separates the concerns much better. There is only one common data store for all historical data. Further, you have to implement audit log feature only once for all tables which should support it.
Compared to the previous solution, when the record changes, only the changed data are logged. Thus it is expected to be a little bit faster but I did not do any tests so I cannot prove it exactly. The real advantage is that you need lower disk space to store data changes. You can improve the table structure when you normalize the log table and you put just integer references to "
table" and "
You can easily retrieve user activity just by the query
SELECT * FROM audit_log WHERE changed_by='admin'
All data changes are logged in a common table so old and new values stored there must be of some generic type. For example, in Microsoft SQL, it can be "
sql_variant". Or you can convert the values to "
varchar" before storing them in the log table.
Rollback is a rather difficult task to be implemented. If you want to roll data back to 3rd March 2010, you have to use query:
SELECT [table], [column], old_value, new_value FROM audit_log WHERE _
[table]='persons' AND row=124 and changed_date > '2010-03-01' ORDER BY changed_date
and then you iterate through the result set and build the dynamic query because table and column information is in the result. It is more complicated compared to the previous solution. Fortunately, the feature has to be implemented only once and then can be reused across the whole application.
- 30th August, 2010: Initial post