Blazing Fast Source Code Search in the Cloud





0/5 (0 vote)
This blog post shows how you can leverage dtSearch to perform fast searches of data safely stored in the Microsoft Azure cloud.
Introduction
Using dtSearch and the techniques in this article will make your data searches lightning fast, making it possible to search terabytes of data with sub-second response time.
But first, two preliminary notes about this blog post. (1) The blog post describes source code data, but the same approach would apply to other data stored in the Microsoft Azure cloud: HTML, XML, MS Office documents -- even email data. (2) While the data in this blog post resides in the Microsoft Azure cloud, the indexes are on a local PC. A subsequent article will address data and indexes in the cloud.
Here is a workplan of our overall project:
In part one of this article we are going to go to the Azure portal and provision the storage account. Naturally, the assumption is that you have signed up for an Azure account. If you have not, it's relatively easy to sign up for a free trial, so you can see if it meets your needs before you commit your money.
Once you provision your storage account, access keys will be automatically generated. These access keys will be copied into our Visual Studio project, because they are the secret keys that give privileged access to your storage account, the place where we’re going to copy the source code to be indexed and later searched.
Part two of this article will show you where we can get the Visual Studio solution with the starter code. This solution will dramatically reduce the amount of work we actually have to do to implement this useful source code searching application. If you install the full edition of the dtSearch Engine, the starter project actually gets installed in your program files folder.
We will be using Visual Studio 2013 with the latest updates. We will also install the latest Azure Storage SDK binaries.
It's in part three where the real work starts. What we want to do here is build the capability to upload your source code into your storage account. There are various utilities that you can download to perform the task of uploading source code to your storage account, but it will be far more convenient if we can build this into our main searching application. Once we finish this retrofit and upgrade, we can then run the application to upload the source code, index it, and then move to part four of our work plan.
Part four will be fast and easy because we will be pretty much done with the difficult work. Part four is about testing and packaging our application. The index files that get generated could be copied to other client computers. That means we can copy the application along with the generated index files to any computer to perform lightning fast source code searches.
Part 1 - Provisioning at the Azure portal
Provisioning the storage account is actually quite simple. At the time of this writing the traditional Azure portal is the place to go. But after the first week of May 2015, Microsoft will release the new portal.
Portal to 5/5/2015 | http://manage.windowsazure.com/ |
New Portal after 5/5/2015 | http://portal.azure.com/ |
Once you log into the Azure portal, it's a simple matter of navigating to the STORAGE menu item and clicking NEW.
A QUICK CREATE menu item will become visible. Click on that to continue.
At this point you are ready to provide the URL, location, and the replication mode. The URL you come up with needs to be globally unique. As you can see "mysourcecode" was not taken. I chose "East US" for my location, but you can choose from among the world’s data centers. A closer data center means lower latency. You can read about replication options here: http://blogs.msdn.com/b/windowsazurestorage/archive/2013/12/11/introducing-read-access-geo-replicated-storage-ra-grs-for-windows-azure-storage.aspx.
When you are done, click CREATE STORAGE ACCOUNT in the lower right corner. It should take less than five minutes to provision your storage account. It took less than a minute for me when I did it.
When the portal indicates that your storage account is ONLINE, you are ready to move forward. Click on the small arrow that's pointing right to drill into the details of this newly provisioned storage account.
You are now ready to copy access keys to the clipboard. Click on MANAGE ACCESS KEYS.
Click on the icon of the red box to copy the PRIMARY ACCESS KEY into your clipboard and store it in a safe place along with the STORAGE ACCOUNT NAME. Both your STORAGE ACCOUNT NAME and your PRIMARY ACCESS KEY will be different from what you see here.
Storage Account Name | mysourcecode |
Primary Access Key | CnQ6dUXdOQ81qSCFJhscuB3PCNM92o4bIuDoKG7mO 7tJ1imxa5sMkzKtnghsG11EwKgxRaTW5g6fFKRcXZ8z6g== |
Part 2 - Locating the starter project
The starter project that ships with the dtSearch Engine can be found under the program files folder here:
- C:\Program Files (x86)\dtSearch Developer\examples\cs4\AzureBlobDemo\AzureBlobDemo.sln
The starter project provides an excellent starting point for us to begin our work. Be sure you are using Visual Studio 2013 with all the latest updates installed.
The project should open up seamlessly, but we want to be sure we have the latest Azure Storage binaries installed. We will right-click in Visual Studio's Solution Explorer and select Manage NuGet Packages.
In the upper right search box, type in "Azure Storage." As you would expect, this brings up the Windows Azure Storage client library, which we are going to use to read and write from and to the Windows Azure Storage account that we will provision momentarily.
In Visual Studio Solution Explorer you can expand the references node to validate that we have the storage client libraries installed.
Part 3 - Adding the storage account connection information to app.config
Now is a good time to copy the storage account information into your app.config file. The app.config file provides a convenient location that is globally accessible to your application. It will be accessed at run time. It is not appropriate to ask users to continually provide the connection information every time they use the application.
Modifying App.Config
<?xml version="1.0"?>
<configuration>
<startup>
<supportedRuntime
version = "v4.0"
sku = ".NETFramework,Version=v4.0"/>
</startup>
<appSettings>
<add
key = "StorageAccountName"
value = "mysourcecode"/>
<add
key = "AccessKey"
value = "CnQ6dUXdOQ81qSCFJhscuB3PCNM92o4bIuDoKG7mO7tJ1imxa5sMkzKtnghsG11EwKgxRaTW5g6fFKRcXZ8z6g=="/>
</appSettings>
</configuration>
Options for encryption
If you would like to encrypt this information, there are several options here:
- https://social.msdn.microsoft.com/Forums/vstudio/en-US/40fd141d-ddbd-4228-8020-df2e3275c8f6/how-to-encrypt-password-in-connection-string?forum=vbgeneral
- https://social.msdn.microsoft.com/Forums/vstudio/en-US/9fc80a8a-13f2-470f-b295-ec55a5bf4931/how-to-encrypt-application-settings-in-appconfig?forum=vbgeneral
Adding support to upload source code to your Azure Storage Account
Our next task is to enhance the starter project to enable source code uploads. Adding this capability directly into the application will dramatically improve usability. In this section, we will add a command button and then write some code.
Here's what the application looks like before our changes. This is MainForm.cs.
We will now add a third button as seen below. The name of the button is cmdAddCode and the caption reads (Text Property) Add source code to Azure Storage. You will need to move the index and search buttons down a little bit to make room for this new third button.
From the designer, click on the Add source code to Azure storage button to retrieve the code.
We will now add some code that will provide the ability to upload source code.
Repeat the steps from an earlier step to ADD A REFERENCE. The reference we will add is System.Configuration
. Be sure you have the check box inside the red box checked before clicking OK.
Be sure that the top of MainForm.cs has the following new statements in place.
Modifying MainForm.cs
private void cmdAddCode_Click(object sender, EventArgs e)
{
string windowTitle = this.Text;
try
{
string selectedFolder = null;
FolderBrowserDialog fDialog = new FolderBrowserDialog();
// if the user has clicked the OK button after choosing a file,To display a MessageBox with the path of the file:
if (fDialog.ShowDialog() == DialogResult.OK)
{
selectedFolder = fDialog.SelectedPath.ToString();
}
string storageAccountName = ConfigurationManager.AppSettings["StorageAccountName"];
string accessKey = ConfigurationManager.AppSettings["AccessKey"];
string connString = string.Format("DefaultEndpointsProtocol=https;AccountName={0};AccountKey={1}",
storageAccountName, accessKey);
// Parse the connection string and create a client
var storageAccount = CloudStorageAccount.Parse(connString);
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
List<FileInfo> filesToUpload = new List<FileInfo>();
RecursiveFileUpload(selectedFolder, filesToUpload, "*.*");
var fileUploadParallelism = new ParallelOptions() {MaxDegreeOfParallelism = 4};
string blobContainerName = "code";
blobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = blobClient.GetContainerReference(blobContainerName);
container.CreateIfNotExists();
Parallel.ForEach(filesToUpload, fileUploadParallelism, currentFileInfo =>
{
// Fix up the file path so it works with a blob path
string cloudFileNamePath = currentFileInfo.FullName.Replace(@"\", @"_");
cloudFileNamePath = cloudFileNamePath.Length == 0 ? "" : cloudFileNamePath;
if (cloudFileNamePath.Length > 0)
{
if (cloudFileNamePath.Substring(0, 1).Equals("/"))
{
cloudFileNamePath = cloudFileNamePath.Substring(1);
}
}
try
{
var blobFileToUpload = container.GetBlockBlobReference(cloudFileNamePath);
ShowTitle("Uploading..." + currentFileInfo.Name);
if (!blobFileToUpload.Exists())
{
blobFileToUpload.OpenWrite(null, null, null);
blobFileToUpload.UploadFromFile(currentFileInfo.FullName, FileMode.Open, null, null, null);
}
}
catch (Exception exception)
{
MessageBox.Show("Issue with blob upload = " + exception.Message.ToString());
}
}
);
}
catch (Exception ex)
{
throw;
}
finally
{
this.Text = windowTitle;
}
}
delegate void StringParameterDelegate(string value);
public void ShowTitle(string value)
{
if (InvokeRequired)
{
// We're not in the UI thread, so we need to call BeginInvoke
BeginInvoke(new StringParameterDelegate(ShowTitle), new object[] { value });
return;
}
// Must be on the UI thread if we've got this far
this.Text = value;
}
private List<FileInfo> RecursiveFileUpload(string sourceDir, List<FileInfo> filesToCopy, string search_type)
{
DirectoryInfo sDirInfo = null;
FileInfo sFileInfo = null;
if (!(sourceDir.EndsWith(Path.DirectorySeparatorChar.ToString())))
{
sourceDir += Path.DirectorySeparatorChar;
}
try
{
foreach (string sDir in Directory.GetDirectories(sourceDir))
{
sDirInfo = new DirectoryInfo(sDir);
RecursiveFileUpload(sDir, filesToCopy, search_type);
sDirInfo = null;
}
}
catch (Exception ex)
{
MessageBox.Show("Issue with RecursiveFileUpload " + ex.Message.ToString());
}
try
{
string[] theFiles = Directory.GetFiles(sourceDir);
foreach (string sFile in theFiles)
{
if (sFile.Length >= 1024)
continue;
sFileInfo = new FileInfo(sFile);
try
{
filesToCopy.Add(sFileInfo);
}
catch (System.IO.IOException ex)
{
MessageBox.Show("Skipping " + sDirInfo.FullName + " because of " + ex.Message.ToString());
}
sFileInfo = null;
}
}
catch (System.UnauthorizedAccessException ex)
{
MessageBox.Show("Skipping " + sourceDir + " because of " + ex.Message.ToString());
}
catch (System.Exception ex)
{
MessageBox.Show("Skipping " + sourceDir + " because of " + ex.Message.ToString());
}
return filesToCopy;
}
Some of the code needs updating in the Rewind()
method of the BLOBDATASOURCE.CS file.
// Fixes for BlobDataSource.cs
//
public override bool Rewind()
{
// Check connection interaction success flag. If an earlier attempt to
// connect to the storage failed, then method will not be successful.
if (_isStorageFailed)
return false; // failure code - no documents to read
// Setup the connection to Windows Azure Storage
try
{
// Parse the connection string and create a client
var storageAccount = CloudStorageAccount.Parse(_connectionString);
_blobClient = storageAccount.CreateCloudBlobClient();
// Create (or re-create) the blob table
_blobTable = new Dictionary<string, List<string>>();
// Add all files into the blob table using the container name as the key
foreach (CloudBlobContainer container in _blobClient.ListContainers())
{
// Get the BlobTable key: the container name
string containerName = container.Name;
// Get the BlobTable value: a list of blob URIs
List<string> blobURIs = new List<string>();
//List blobs and directories in this container
var blobs = container.ListBlobs();
// FIX: Used to be foreach (CloudBlob blob in container.ListBlobs())
foreach (var blobItem in blobs)
{
blobURIs.Add(blobItem.Uri.ToString());
//System.Diagnostics.Debug.WriteLine(blobItem.Uri.ToString());
}
// Add the entry to the BlobTable
_blobTable.Add(containerName, blobURIs);
}
// Initialize iterators; fail if not successful
if (!ResetIterators())
{
_isStorageFailed = true;
return false;
}
// Set success
_isStorageFailed = false;
return true;
}
catch (Exception ex)
{
// Add diagnostic code here if desired
// Set failure
_isStorageFailed = true;
return false;
}
}
We have made some modifications to AskConnectForm.cs.
This will always retrieve the connection string so that the user doesn't have to type it in continually. Ideally, we could write some code to completely bypass the AskConnectForm form, but I'm trying to avoid too many modifications to keep this post straightforward.
public AskConnectForm()
{
//
// Required for Windows Form Designer support
//
InitializeComponent();
// Add the code below
string storageAccountName = ConfigurationManager.AppSettings["StorageAccountName"];
string accessKey = ConfigurationManager.AppSettings["AccessKey"];
string connString = string.Format("DefaultEndpointsProtocol=https;AccountName={0};AccountKey={1}",
storageAccountName, accessKey);
this.ConnectString.Text = connString;
}
Part 4 - Testing
We are now ready to start testing the application that we just updated. One thing that might be of interest is to verify that we correctly updated our storage account with the source code. I ran the application once and uploaded source code to Azure Storage, as seen in the picture below.
You can download the Azure Storage Explorer for free at the following URL:
http://azurestorageexplorer.codeplex.com/
Once you've installed and configured Azure Storage Explorer, you can go and browse the containers for whatever source code you may have previously uploaded. It also allows you to delete the content should you want to do so.
Although we are adding source code, you can pretty much add any file, whether those are Word documents or PowerPoint. dtSearch will automatically index many different types of documents.
By the way, the previous code performs the upload asynchronously, and the developer can control the level of concurrency depending on network and system resources.
See the code snippet:
var fileUploadParallelism = new ParallelOptions() {MaxDegreeOfParallelism = 4};
Click the highlighted button to add source code up to your Azure Storage account.
You can repeat this process of selecting a folder that contains the source code you wish to upload. All the files in the folder (and sub-folders) you pick will also be used to populate Windows Azure Storage with source code.
When the index is created it will need a location to store the index files.
Enter a valid location and then hit the Index button.
We already entered the necessary code above to populate this dialog box with the appropriate connection string. You can just hit OK on this dialog box.
You will click on two buttons in this dialog box. The first button is Index an Azure storage account. The second button is Search.
Our work is complete. You are now able to get lightning quick results searching your keywords up against your Azure Storage account.
If you decide to add more source code to the Azure Storage account, you will need to regenerate the indexes.
Conclusion
You can now search literally terabytes of source code and get instant search results. One of the core advantages here is that you don't have to store all the source code locally on your own laptop or desktop computer. All the source code can be securely stored up in your Azure Storage account, available only to those that have the access keys.
Other Resources
- Faceted Search with dtSearch (using SQL and .NET)
http://www.codeproject.com/Articles/756185/Faceted-Search-with-dtSearch-Not-Your-Average-Sear - Turbo Charge your Search Experience with dtSearch and Telerik UI for ASP.NET
http://www.codeproject.com/Articles/769086/Turbo-Charge-your-Search-Experience-with-dtSearch - A Search Engine in Your Pocket: Introducing dtSearch on Android
http://www.codeproject.com/Articles/824413/A-Search-Engine-in-Your-Pocket-Introducing-dtSearc
More on dtSearch
dtSearch.com
A Search Engine in Your Pocket – Introducing dtSearch on Android
Blazing Fast Source Code Search in the Cloud
Using Azure Files, RemoteApp and dtSearch for Secure Instant Search Across Terabytes of A Wide Range of Data Types from Any Computer or Device
Windows Azure SQL Database Development with the dtSearch Engine
Faceted Search with dtSearch – Not Your Average Search Filter
Turbo Charge your Search Experience with dtSearch and Telerik UI for ASP.NET
Put a Search Engine in Your Windows 10 Universal (UWP) Applications
Indexing SharePoint Site Collections Using the dtSearch Engine DataSource API
Working with the dtSearch® ASP.NET Core WebDemo Sample Application
Using dtSearch on Amazon Web Services with EC2 & EBS
Full-Text Search with dtSearch and AWS Aurora