Click here to Skip to main content
Click here to Skip to main content

CLOB Comparison Line by Line in PL/SQL

By , 18 Apr 2011
 

Introduction

I wrote this package for fun in order to be able to compare with PL/SQL code. The changes between to packages are saved as CLOBs. But it can be used to compare any two CLOBs where the only restriction is the maximum line length (4000).

It is based on the naive Longest Common Subsequence algorithm described here:

From the links above, we can see that for X and Y which are two sequences with length m and n respectively, the Longest Common Subsequence for the two sequences is defined by:

The pseudo-code to calculate the LCS Matrix (M) is given by:

function LCS_MATRIX(X[1..m], Y[1..n])
    M = array(0..m, 0..n)
    for i := 0..m
       M[i,0] = 0
    for j := 0..n
       M[0,j] = 0
    for i := 1..m
        for j := 1..n
            if X[i] = Y[j]
                M[i,j] := C[i-1,j-1] + 1
            else:
                M[i,j] := max(M[i,j-1], M[i-1,j])

And based on the LCS Matrix, we can calculate the differences between both sequences with the following pseudo-code:

function printDiff(M[0..m,0..n], X[1..m], Y[1..n], i, j)
    if i > 0 and j > 0 and X[i] = Y[j]
        printDiff(M, X, Y, i-1, j-1)
        print "  " + X[i]
    else
        if j > 0 and (i = 0 or M[i,j-1] ≥ M[i-1,j])
            printDiff(M, X, Y, i, j-1)
            print "+ " + Y[j]
        else if i > 0 and (j = 0 or M[i,j-1] < M[i-1,j])
            printDiff(M, X, Y, i-1, j)
            print "- " + X[i]
        else
            print ""

My goal on this article was to implement the pseudo-code above in the PL/SQL language, being the sequences X and Y defined by the lines of two CLOBs.

How the Algorithm Implementation Works

In the first step of the algorithm, the CLOBs are transformed into Arrays, where each line of the CLOB is mapped into each record of the array. This is done in the function clob_to_array which returns a varchar2 array.

One of the major difficulties of implementing this algorithm is that PL/SQL is not multi-dimension array friendly. While in other languages it's pretty easy to create an m x n array, in PL/SQL, it's a little bit more complicated. Thanks to Solomon Yakobson from Oracle Forums for teaching me how to do this:

TYPE    t_number_array IS TABLE OF NUMBER;
TYPE    t_bidimensional_number_array IS TABLE OF t_number_array;
matrix  t_bidimensional_number_array := t_bidimensional_number_array();

FOR i IN 1 .. m + 1
LOOP
  matrix.extend;
  matrix(i) := t_number_array();
  FOR j IN 1 .. n + 1
  LOOP
    matrix(i).extend;
  END LOOP;
END LOOP;

Notice that in PL/SQL, the arrays start at value 1, so in order be able to have a Matrix with dimension [0..m,0..n], the Matrix I used is defined by [1..m+1,1..n+1].

After having the old and new file arrays, and having the LCS Matrix initialized, we can start populating it. The following procedure describes the logic used:

FOR i IN 1 .. m + 1
LOOP
  FOR j IN 1 .. n + 1
  LOOP
    IF old_array(i - 1) = new_array(j - 1) THEN
      matrix(i)(j) := matrix_i(i - 1) (j - 1) + 1;
    ELSE
      matrix(i)(j) := greatest(matrix(i) (j - 1), matrix(i - 1) (j));
    END IF;
  END LOOP;
END LOOP;

I wanted to display the differences line by line, instead of showing just one single output with the added and deleted lines as the one described in Introduction. So my main output is an array t_differences_array which shows the old and new lines side by side.

CREATE OR REPLACE TYPE t_differences AS OBJECT
(
  old_line_number NUMBER,
  old_file        VARCHAR2(4000),
  status          VARCHAR2(20),
  new_line_number NUMBER,
  new_file        VARCHAR2(4000)
);
CREATE OR REPLACE TYPE t_differences_array as table of t_differences;

For example, given the files OLD_FILE{A,D,E,B} and NEW_FILE{A,C,B}, the standard algorithm would display the differences like this:

example1.jpg

While the goal I wanted to achieve was to not add extra lines when they were not necessary, which means changing the output to:

example2.jpg

This goal was achieved by adding extra logic to the get_differences procedure. Mainly by saving the line where the differences started, and then starting to populate the other side differences from this saved line until the next equal line.

In the case where the number of different lines do not match, we will have one of the sides with a blank line number. This will also help distinguish the difference from a non-existing line from an existing blank line.

The following code optimizations are included in the algorithm:

  • Reducing the problem set: Since in real life cases we normally compare files with small modifications, the LCS Matrix can be greatly reduced if we don't take into consideration the beginning and the end of the files where the lines don't change.
  • Reducing the comparison time: When lines are too big, we can reduce the comparison time by hashing the lines and only comparing the hashed value.

Because of the reduced problem set optimization, the real code has three extra parameters defined by the first line where the differences start and the two end lines where the differences stop. These two parameters make the code harder to understand.

Using the code

It includes two main functions for comparing:

FUNCTION compare_line_by_line(old_file_i IN CLOB,
                              new_file_i IN CLOB) RETURN t_differences_array

FUNCTION compare_line_by_line_refcursor(old_file_i IN CLOB,
                                        new_file_i IN CLOB) RETURN SYS_REFCURSOR

where the only difference is the output type.

I included a test table compare_file_test, so you can easily test the program for two different test cases.

Below, you can see an example of how to use the test table to compare two files:

WITH t AS
 (SELECT *
    FROM compare_file_test
   WHERE id = 1)
SELECT a.*
  FROM t,
       TABLE(clob_compare.compare_line_by_line(t.old_file, t.new_file)) a

clob_compare.jpg

What is still missing

I plan to include a few extra options in future versions:

  • Show only different lines
  • Ignore cases
  • Ignore spaces
  • Trim lines

Last notes

I've only tested my implementation for a few test cases, it is possible that some bugs still exist.

I would appreciate suggestions/comments in order to improve and optimize the code.

History

  • Version 1.0: 2011/04/05.

License

This article, along with any associated source code and files, is licensed under The BSD License

About the Author

Manuel Vidigal
Technical Lead
Portugal Portugal
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
BugOne more bugmemberAndres Tsengov16 Nov '11 - 21:46 
Fragments form the code:
 
PROCEDURE populate_matrix(
/---/
                            old_array_i    IN ws_varchar2_tbl,
                            new_array_i    IN ws_varchar2_tbl,
/---/
    FOR i IN 2 .. (end_line_old_i - start_line_i + 1) + 1
    LOOP
      FOR j IN 2 .. (end_line_new_i - start_line_i + 1) + 1
      LOOP
        -----------------------------------------------
        -- Populate LCS array based on LCS algorithm --
        -----------------------------------------------
        IF old_array_i(i - 1) = new_array_i(j - 1) THEN 
/---/
 
and the call:
    -------------------------
    -- Populate LCS Matrix --
    -------------------------
    populate_matrix(matrix_i       => l_matrix,
                    old_array_i    => l_old_file_array_hashed,
                    new_array_i    => l_new_file_array_hashed,
/---/
 
See how i and j loop through the matrix - that is - only the part of code between the first and the last difference. But look what gets passed to old_array_i and new_array_i - the whole text - the (equal) beginning, the changed part and the (equal) end part. So the matrix gets populated by comparing old and new lines from the beginning, and not from where the differences start. It happens to work when the first line has also changed but does it all wrong otherwise.
So, the line
        IF old_array_i(i - 1) = new_array_i(j - 1) THEN 
should be
        IF old_array_i(i + start_line_i - 1 - 1) = new_array_i(j + start_line_i - 1 - 1) THEN 
and we have a much better program.
 
Thanks a lot for Your effort to implement the algorithm, it saved me a lot of time! The way I use it - I also added the optional parameter to trim the rows and actually use the code twice - in the first go to find the changed rows and in the second go finding the difference in the changed row (by adding a chr(10) after every character in both old and new line).
GeneralRe: One more bugmembermptiga5 Dec '12 - 8:13 
Great, with this fix it works perfekt!
 
Thanks Manuel for the nice package,
thanks Andres for the fix Smile | :)
GeneralBug in the codememberDave_Five13 Apr '11 - 21:33 
If your old clob is shorter than your new clob it errors with the below sometimes:
 
Error at line 1
ORA-06533: Subscript beyond count
ORA-06512: at "USER.CLOB_COMPARE", line 286
ORA-06512: at "USER.CLOB_COMPARE", line 447
ORA-06512: at "USER.CLOB_COMPARE", line 483
ORA-06512: at line 43
 
After quite a bit of playing I found this is the easiest way to replicate, if you pass in the below you get the error:
 
    l_resp1_c := 'blahblah
six';
    l_resp2_c := 'four
five
six';
 
But if you reverse it like below you do not get the error:
 
    l_resp1_c := 'four
five
six';
    l_resp2_c := 'blahblah
six';
 
This also works ( change six to five in l_resp1_c ):
 
    l_resp1_c := 'blahblah
five';
    l_resp2_c := 'four
five
six';
 
Also you are missing the creation of your t_varchar2_array.
GeneralRe: Bug in the codememberManuel Vidigal14 Apr '11 - 5:53 
Thanks for the info Dave.
 
I can confirm the error, I will try to fix it when I get home.
 
And yes, the creation of t_varchar2_array is missing.
GeneralRe: Bug in the codememberManuel Vidigal18 Apr '11 - 1:02 
Hi Dave,
 
I just uploaded version 1.1, which should correct the bug you found. Can you please check if it's ok now?
 

Thanks in advance.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130516.1 | Last Updated 18 Apr 2011
Article Copyright 2011 by Manuel Vidigal
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid