Introduction
Tools to detect duplicate code can add up to code quality effectively. In the context of C#; various tools and options are available to achieve this. Visual Studio’s (Ultimate or Premium) Code Clone Detection and Atomiq’s Code Similarity Finder are few notable options. This tip discusses how to build your own code similarity finder using the principle of Levenshtein distance. Between two strings, it is the measure of single character edits required to change one string to the other. Lesser the value, more similar are the two strings and vice versa. The tip also employs Roslyn, the .NET complier platform to parse the target code. The tool presented here builds a “similarity matrix” for a target C# class, which is essentially a square matrix, who’s each row is the Levenshtein distances between the codes of a method and all methods in the class. The diagonal of the matrix will have 0 entries as it would be distance between the same codes of the same method. The following table depicts the similarity matrix:
Code Similarity Matrix
| |
Method 1
|
Method 2
|
…
|
Method n
|
|
Method 1
|
0
|
10
|
|
2
|
|
Method 2
|
10
|
0
|
|
12
|
|
…
|
|
|
|
|
|
Method n
|
2
|
12
|
|
0
|
Here, entries like [1,2]; [2,1], … [1,n];[n,1] essentially have the same values as they are representing the distance between the same pair of strings. This matrix gives a snapshot of the code similarity in a C# class with the entries of lesser values indicating code similarity or duplicate code. The tool discussed here writes this matrix into an Excel file for convenience.
Background
The aim of this tip is not to discuss the implementation of the Levenshtein distance algorithm, but to show how it can effectively be applied for finding code similarities. A sample implementation from the dotnetperls.com is picked and the code is re-used. It also not the intent of the tip to discuss the performance implications of a certain implementation of the algorithm.
Using the Code
The first requirement of the tool is to provide the ability to parse the code of the target class to find all methods. The MethodsAnalyzer class provides this infrastructure. The method Parse of this class takes the string of the code of the target class and uses the Roslyn’s Syntax tree to get all the methods. It builds two dictionaries: indexMap and methodsCache which indexes the method name and method code respectively. These dictionaries are used to build the Code Similarity matrix later.
public void Parse(string code)
{
var tree = CSharpSyntaxTree.ParseText(code);
var syntaxRoot = tree.GetRoot();
var methods = syntaxRoot.DescendantNodes().OfType<MethodDeclarationSyntax>();
var index = 0;
foreach (var method in methods)
{
var methodName = method.Identifier.ToString();
var body = method.Body.ToString();
indexMap[index] = methodName;
methodsCache[index++] = body;
}
}
The Consolidate method loops through the methodsCache to build Code Similarity matrix similarityMatrix. It uses the LevenshteinDistance class’s Compute method which takes the strings of the method bodies of two methods and computes the distance. The Consolidate method does this for each method versus all methods in the class to build the Code Similarity matrix.
public void Consolidate()
{
similarityMatrix = new int[methodsCache.Count, methodsCache.Count];
foreach (var index in methodsCache.Keys)
{
var referenceCode = methodsCache[index];
var allCodes = methodsCache.Values.ToList();
for (var j = 0; j < allCodes.Count; j++)
{
similarityMatrix[index, j] = LevenshteinDistance.Compute(referenceCode, allCodes[j]);
}
}
}
The Driver project uses the CodeAnalyzer project to build the Code Similarity matrix for a target class and then writes it to an Excel file. The sample code is written in the Program class of the Driver project. The Main method first calls the GenerateSimilarityMatrix method which in turn uses the MethodsAnalyzer’s Parse and Consolidate methods to build the Code Similarity matrix. Then the Main method calls the GenerateReport method to write the matrix into an Excel file. The details of the GenerateSimilarityMatrix and GenerateReport are omitted in the discussion and left in the source code with detailed documentation for the reader.
static void Main(string[] args)
{
var program = new Program();
var methodsAnalyzer = new MethodsAnalyzer();
program.GenerateSimilarityMatrix(args[0], methodsAnalyzer);
program.GenerateReport(args[1], methodsAnalyzer);
Console.WriteLine("... Analysis completed. Press any key to exit");
Console.ReadKey();
}
The source code also includes an example target class ContactsProcessor for which the analysis is performed. The target class is hosted in a separated project called Target. Following is a sample out of the tool run:
| |
Add |
Remove |
GetAddresses |
SendMails |
SendMail |
Add |
0 |
35 |
177 |
160 |
215 |
Remove |
35 |
0 |
176 |
154 |
211 |
GetAddresses |
177 |
176 |
0 |
69 |
122 |
SendMails |
160 |
154 |
69 |
0 |
144 |
SendMail |
215 |
211 |
122 |
144 |
0 |
The analysis shows that methods Add; Remove and GetAddresses; SendMails
have relatively low values of the computed Levenshtein distance and thus are indicative of having similar code.
Points of Interest
The tool just handles a single class. The future work can be to extend it for a project and then for the whole solution. Also, the computed distances can be scaled relative to the maximum in the matrix, which would provide better readability and quick identification of similarity between methods if the matrix is sufficiently large.
History
- 25th January, 2016: Initial version