Click here to Skip to main content
15,886,071 members
Articles / Programming Languages / SQL

CLOB Comparison Line by Line in PL/SQL

Rate me:
Please Sign up or sign in to vote.
4.60/5 (4 votes)
18 Apr 2011BSD4 min read 36.9K   496   10   13
An implementation of the Longest Common Subsequence (LCS) diff algorithm in PL/SQL.

Introduction

I wrote this package for fun in order to be able to compare with PL/SQL code. The changes between to packages are saved as CLOBs. But it can be used to compare any two CLOBs where the only restriction is the maximum line length (4000).

It is based on the naive Longest Common Subsequence algorithm described here:

From the links above, we can see that for X and Y which are two sequences with length m and n respectively, the Longest Common Subsequence for the two sequences is defined by:

Image 1

The pseudo-code to calculate the LCS Matrix (M) is given by:

function LCS_MATRIX(X[1..m], Y[1..n])
    M = array(0..m, 0..n)
    for i := 0..m
       M[i,0] = 0
    for j := 0..n
       M[0,j] = 0
    for i := 1..m
        for j := 1..n
            if X[i] = Y[j]
                M[i,j] := C[i-1,j-1] + 1
            else:
                M[i,j] := max(M[i,j-1], M[i-1,j])

And based on the LCS Matrix, we can calculate the differences between both sequences with the following pseudo-code:

function printDiff(M[0..m,0..n], X[1..m], Y[1..n], i, j)
    if i > 0 and j > 0 and X[i] = Y[j]
        printDiff(M, X, Y, i-1, j-1)
        print "  " + X[i]
    else
        if j > 0 and (i = 0 or M[i,j-1] ≥ M[i-1,j])
            printDiff(M, X, Y, i, j-1)
            print "+ " + Y[j]
        else if i > 0 and (j = 0 or M[i,j-1] < M[i-1,j])
            printDiff(M, X, Y, i-1, j)
            print "- " + X[i]
        else
            print ""

My goal on this article was to implement the pseudo-code above in the PL/SQL language, being the sequences X and Y defined by the lines of two CLOBs.

How the Algorithm Implementation Works

In the first step of the algorithm, the CLOBs are transformed into Arrays, where each line of the CLOB is mapped into each record of the array. This is done in the function clob_to_array which returns a varchar2 array.

One of the major difficulties of implementing this algorithm is that PL/SQL is not multi-dimension array friendly. While in other languages it's pretty easy to create an m x n array, in PL/SQL, it's a little bit more complicated. Thanks to Solomon Yakobson from Oracle Forums for teaching me how to do this:

SQL
TYPE    t_number_array IS TABLE OF NUMBER;
TYPE    t_bidimensional_number_array IS TABLE OF t_number_array;
matrix  t_bidimensional_number_array := t_bidimensional_number_array();

FOR i IN 1 .. m + 1
LOOP
  matrix.extend;
  matrix(i) := t_number_array();
  FOR j IN 1 .. n + 1
  LOOP
    matrix(i).extend;
  END LOOP;
END LOOP;

Notice that in PL/SQL, the arrays start at value 1, so in order be able to have a Matrix with dimension [0..m,0..n], the Matrix I used is defined by [1..m+1,1..n+1].

After having the old and new file arrays, and having the LCS Matrix initialized, we can start populating it. The following procedure describes the logic used:

SQL
FOR i IN 1 .. m + 1
LOOP
  FOR j IN 1 .. n + 1
  LOOP
    IF old_array(i - 1) = new_array(j - 1) THEN
      matrix(i)(j) := matrix_i(i - 1) (j - 1) + 1;
    ELSE
      matrix(i)(j) := greatest(matrix(i) (j - 1), matrix(i - 1) (j));
    END IF;
  END LOOP;
END LOOP;

I wanted to display the differences line by line, instead of showing just one single output with the added and deleted lines as the one described in Introduction. So my main output is an array t_differences_array which shows the old and new lines side by side.

SQL
CREATE OR REPLACE TYPE t_differences AS OBJECT
(
  old_line_number NUMBER,
  old_file        VARCHAR2(4000),
  status          VARCHAR2(20),
  new_line_number NUMBER,
  new_file        VARCHAR2(4000)
);
CREATE OR REPLACE TYPE t_differences_array as table of t_differences;

For example, given the files OLD_FILE{A,D,E,B} and NEW_FILE{A,C,B}, the standard algorithm would display the differences like this:

example1.jpg

While the goal I wanted to achieve was to not add extra lines when they were not necessary, which means changing the output to:

example2.jpg

This goal was achieved by adding extra logic to the get_differences procedure. Mainly by saving the line where the differences started, and then starting to populate the other side differences from this saved line until the next equal line.

In the case where the number of different lines do not match, we will have one of the sides with a blank line number. This will also help distinguish the difference from a non-existing line from an existing blank line.

The following code optimizations are included in the algorithm:

  • Reducing the problem set: Since in real life cases we normally compare files with small modifications, the LCS Matrix can be greatly reduced if we don't take into consideration the beginning and the end of the files where the lines don't change.
  • Reducing the comparison time: When lines are too big, we can reduce the comparison time by hashing the lines and only comparing the hashed value.

Because of the reduced problem set optimization, the real code has three extra parameters defined by the first line where the differences start and the two end lines where the differences stop. These two parameters make the code harder to understand.

Using the code

It includes two main functions for comparing:

SQL
FUNCTION compare_line_by_line(old_file_i IN CLOB,
                              new_file_i IN CLOB) RETURN t_differences_array

FUNCTION compare_line_by_line_refcursor(old_file_i IN CLOB,
                                        new_file_i IN CLOB) RETURN SYS_REFCURSOR

where the only difference is the output type.

I included a test table compare_file_test, so you can easily test the program for two different test cases.

Below, you can see an example of how to use the test table to compare two files:

SQL
WITH t AS
 (SELECT *
    FROM compare_file_test
   WHERE id = 1)
SELECT a.*
  FROM t,
       TABLE(clob_compare.compare_line_by_line(t.old_file, t.new_file)) a

clob_compare.jpg

What is still missing

I plan to include a few extra options in future versions:

  • Show only different lines
  • Ignore cases
  • Ignore spaces
  • Trim lines

Last notes

I've only tested my implementation for a few test cases, it is possible that some bugs still exist.

I would appreciate suggestions/comments in order to improve and optimize the code.

History

  • Version 1.0: 2011/04/05.

License

This article, along with any associated source code and files, is licensed under The BSD License


Written By
Technical Lead
Portugal Portugal
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Questiondoes a new version exist? Pin
Member 1171234326-Jul-21 9:12
Member 1171234326-Jul-21 9:12 
Questionwow thank you! Pin
skalka22-Jan-21 11:21
skalka22-Jan-21 11:21 
PraiseVery Good Work Pin
Member 147081377-Jan-20 23:19
Member 147081377-Jan-20 23:19 
PraiseCLOB Comparison Line by Line in PL/SQL Pin
hmasarra25-Jul-18 1:50
hmasarra25-Jul-18 1:50 
QuestionGreat work, thanks! Pin
Member 1319354711-May-17 0:39
Member 1319354711-May-17 0:39 
Questionoptimized version of clob_to_array Pin
Member 380462622-Sep-16 6:46
Member 380462622-Sep-16 6:46 
Questiongreat tool Pin
Member 380462620-Sep-16 7:39
Member 380462620-Sep-16 7:39 
BugOne more bug Pin
Andres Tsengov16-Nov-11 21:46
Andres Tsengov16-Nov-11 21:46 
GeneralRe: One more bug Pin
mptiga5-Dec-12 8:13
mptiga5-Dec-12 8:13 
GeneralBug in the code Pin
Dave_Five13-Apr-11 21:33
Dave_Five13-Apr-11 21:33 
GeneralRe: Bug in the code Pin
Manuel Vidigal14-Apr-11 5:53
Manuel Vidigal14-Apr-11 5:53 
GeneralRe: Bug in the code Pin
Manuel Vidigal18-Apr-11 1:02
Manuel Vidigal18-Apr-11 1:02 
Hi Dave,

I just uploaded version 1.1, which should correct the bug you found. Can you please check if it's ok now?


Thanks in advance.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.