Building A Recommendation Engine - Machine Learning Using Azure, Hadoop And Mahout

Anoop Pillai

4.74/5 (17 votes)

Jul 14, 2013

CPOL

13 min read

72957

Doing some 'Big Data' and building a Recommendation Engine with Azure, Hadoop and Mahout

Feel like helping some one today?

Let us help the Stack Exchange guys to suggest questions to a user that he can answer, based on his answering history, much like the way Amazon suggests you products based on your previous purchase history. If you don’t know what Stack Exchange does – they run a number of Q&A sites including the massively popular Stack Overflow.

Our objective here is to see how we can analyze the past answers of a user, to predict questions that he may answer in future. May Stack Exchange’s current recommendation logic may work better than ours, but that won’t prevent us from helping them for our own learning purposes ;)

We’ll be doing the following tasks.

Extracting the required information from Stack Exchange data set
Using the required information to build a Recommender

But let us start with the basics. If you are totally new to Apache Hadoop and Hadoop On Azure, I recommend you to read these introductory articles before you begin, where I explain HDInsight and Map Reduce model a bit in detail.

Behind the Scenes

Here we go, let us get into some “data science” woo do first. Cool!! Distributed Machine learning is mainly used for

Recommendations - Remember the Amazon Recommendations? Normally used to predict preferences based on history.
Clustering - For tasks like finding grouping together related documents from a set of documents, or finding like minded people from a community
Classification - For identifying which set of category a new item belongs to. This normally includes training the system first, and then asking the system to detect an item.

“Big Data” jargon is often used when you need to perform operations on a very large data set. In this article, we’ll be dealing with extracting some data from a large data set, and building a Recommender using our extracted data.

What is a Recommender?

Broadly speaking, we can build a recommender either by

Finding questions that a user may be interested in answering, based on the questions answered by other users like him
Finding other questions that are similar to the questions he answered already.

The first technique is known as user based recommendation, and the second technique is known as item based recommendations.

In the first case, taste can be determined by how may questions you answered in common with that user (the questions both of you answered). For example, think about User1, User2, User3 and User4 – Answering few questions Q1, Q2, Q3 and Q4. This diagram shows the Questions answered by the users

Based on the above diagram, User1 and User2 answered Q1, Q2 and Q3 – and if User2 answered Q2 and Q3 but not Q1. Now, to some extent, we can safely assume that User3 will be interested in answering Q1 – because two users who answered Q2 and Q3 with him already answered Q1. There is some taste matching here, isn’t it? So, if you have a array of {UserId, QuestionId} – it seems that data is enough for us to build a recommender.

The Logic Side

Now, how exactly we are going to do build a question recommender? In fact it is quite simple.

First, we need to find the number of times a pair of questions co-occur across the available users. Note that this matrix is having no relations with the user. For example, if Q1 and Q2 is appearing together 2 times (as in the above diagram), co occurrence value at {Q1,Q2} will be 2. Here is the co-occurrence matrix (hope I got this right).

Q1 and Q2 co-occurs 2 times (User1 and User2 answered Q1 ,Q2)
Q1 and Q3 co-occurs 2 times (User1 and User2 answered both Q1, Q2)
Q2 and Q3 co-occurs 3 times (User1, User2 and User3 answered Q2, Q3)
Like wise..

The above matrix just captures how many times a pair of questions co-occurred (answered) as discussed above. There is no mapping with users yet. Now, how we’ll relate this to find a user’s preference? To find out how close a question ‘matches’ a user, we just need to

Find out how often that question co occurs with other questions answered by a that user
Eliminate questions already answered by the user.

For the first step, we need to multiply the above matrix with the user’s preference matrix.

For example, let us Take User3. For User3, the Preference mapping with questions [Q1,Q2,Q3,Q4] is [0,1,1,0] because he already answered Q2 and Q3, but not Q1 and Q4. So, let us multiply this with the above co-occurrence matrix. Remember that this is a matrix multiplication /dot product. The Result indicates how often a Question co-occurs with other questions answered by a user (weightage).

We can omit Q2 and Q3 from the results, as we know the User 3 already answered them. Now, from the remaining, Q1 and Q4 – Q1 has the higher value (4) and hence the higher taste matching with User3. Intuitively, this indicated Q1 co-occurred with the questions already answered by User 3 (Q2 and Q3) more than Q4 co-occurred with Q2 and Q3 – so User3 will be interested in answering Q1 more than Q4. In an actual implementation, note that the User’s taste matrix will be a sparse matrix (mostly zeros) as the user will be answering only a very limited subset of questions in the past. The advantage of the above logic is, we can use a distributed map reduce model for compute with multiple map-reduce tasks - Constructing the co-occurrence matrix, Finding the dot product for each user etc.

It may help if you checkout my introduction to map-reduce and an example here

Implementation

From the implementation point of view,

We need to provision a Hadoop Cluster
We need to download and extract the data to analyze (Stack Overflow data)
Job 1 – Extract the Data - From each line, extract {UserId, QuestionId} for all questions answered by the user.
Job 2 – Build the Recommender - Use the output from above Map Reduce to build the recommendation model where possible items are listed against each user.

Let us roll!!

Step 1 - Provisioning Your Cluster

Now remember, the Stack Exchange data is huge. So, we need to have a distributed environment to process the same. Let us head over to Windows Azure. If you don’t have an account, sign up for the free trial. Now, head over to the preview page, and request the HDInsight (Hadoop on Azure) preview.

Once you have the HD Insight available, you can create a Hadoop cluster easily. I’m creating a cluster named stackanalyzer.

Once you have the cluster ready, you’ll see the Connect and Manage buttons in your dashboard (Not shown here). Connect to the head node of your cluster by clicking the ‘Connect’ button, which should open a Remote Desktop Connection to the head node. You may also click the ‘Manage’ button to open your web based management dashboard. (If you want, you can read more about HD Insight here)

Step 2 - Getting Your Data To Analyze

Once you connected to your cluster’s head node using RDP, you may download the Stack Exchange data. You can download the Stack Exchange sites data from Clear Bits, under Creative Commons. I installed Mu-Torrent client in the head node, and then downloaded and extracted the data for http://cooking.stackexchange.com/ – The extracted files looks like this – a bunch of XML files.

What we are interested is in the Posts XML File. Each line represents either a question, or an answer. If it is a question, PostTypeId =1, and if it is an answer, PostTypeId=2.The ParentId represents the question’s Id for an answer, and OwnerUserId represents the guy who wrote the answer for this question.

<row Id="16" PostTypeId="2" ParentId="2" CreationDate="2010-07-09T19:13:37.540" Score="3"
Body="<p>...shortenedforbrevity... </p>
"
OwnerUserId="34" LastActivityDate="2010-07-09T19:13:37.540" />

So, for us, we need to extract the {OwnerUserId, ParentId} for all posts where PostTypeId=2 (Answers) which is a representation of {User,Question,Votes}. The Mahout Recommender Job we’ll be using later will take this data, and will build a Recommendation result.

Now, extracting this data itself is a huge task when you consider the Posts file is huge. For the Cooking site, it is not so huge – but if you are analyzing the entire Stack Overflow, the Posts file may come in GBs. For extraction of this data itself, let us leverage Hadoop and write a custom Map Reduce Job.

Step 3 - Extracting The Data We Need From the Dump (User, Question)

To extract the data, we’ll leverage Hadoop to distribute. Let us write a simple Mapper. As mentioned earlier, we need to figure out {OwnerUserId, ParentId} for all posts with PostTypeId=2. This is because, the input for the Recommender Job we may run later is {user, item}. For this, first load the Posts.XML to HDFS. You may use the hadoop fs command to copy the local file to the specified input path.

Now, time to write a custom mapper to extract the data for us. We’ll be using Hadoop On Azure .NET SDK to write our Map Reduce job. Not that we are specifying the input folder and output folder in the configuration section. Fire up Visual Studio, and create a C# Console application. If you remember from my previous articles, hadoop fs <yourcommand> is used to access HDFS file system, and it’ll help if you know some basic *nix commands like ls, cat etc.

Note: See my earlier posts regarding the first bits of HDInsight to understand more about Map Reduce Model and Hadoop on Azure

You need to install the Hadoop Map Reduce package from Hadoop SDK for .NET via Nuget package manager.

install-package Microsoft.Hadoop.MapReduce

Now, here is some code where we

Create A Mapper
Create a Job
Submit the Job to the cluster

Here we go.

using System;
using System.Collections.Generic;
using System.Globalization;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using Microsoft.Hadoop.MapReduce;

namespace StackExtractor
{

    //Our Mapper that takes a line of XML input and spits out the {OwnerUserId,ParentId,Score} 
    //i.e, {User,Question,Weightage}
    public class UserQuestionsMapper : MapperBase
    {
        public override void Map(string inputLine, MapperContext context)
        {
            try
            {
                var obj = XElement.Parse(inputLine);
                var postType = obj.Attribute("PostTypeId");
                if (postType != null && postType.Value == "2")
                {
                    var owner = obj.Attribute("OwnerUserId");
                    var parent = obj.Attribute("ParentId");
		   
                    // Write output data. Ignore records will null values if any
                    if (owner != null && parent != null )
                    {
                        context.EmitLine(string.Format("{0},{1}", owner.Value, parent.Value));
                    }
                }
            }
            catch
            {
                //Ignore this line if we can't parse
            }
        }
    }


    //Our Extraction Job using our Mapper
    public class UserQuestionsExtractionJob : HadoopJob<UserQuestionsMapper>
    {
        public override HadoopJobConfiguration Configure(ExecutorContext context)
        {
            var config = new HadoopJobConfiguration();
            config.DeleteOutputFolder = true;
            config.InputPath = "/input/Cooking";
            config.OutputFolder = "/output/Cooking";
            return config;
        }

       
    }

    //Driver that submits this to the cluster in the cloud
    //And will wait for the result. This will push your executables to the Azure storage
    //and will execute the command line in the head node (HDFS for Hadoop on Azure uses Azure storage)
    public class Driver
    {
        public static void Main()
        {
            try
            {
                var azureCluster = new Uri("https://{yoururl}.azurehdinsight.net:563");
                const string clusterUserName = "admin";
                const string clusterPassword = "{yourpassword}";

                // This is the name of the account under which Hadoop will execute jobs.
                // Normally this is just "Hadoop".
                const string hadoopUserName = "Hadoop";

                // Azure Storage Information.
                const string azureStorageAccount = "{yourstorage}.blob.core.windows.net";
                const string azureStorageKey =
                    "{yourstoragekey}";
                const string azureStorageContainer = "{yourcontainer}";
                const bool createContinerIfNotExist = true;
                Console.WriteLine("Connecting : {0} ", DateTime.Now);

                var hadoop = Hadoop.Connect(azureCluster,
                                            clusterUserName,
                                            hadoopUserName,
                                            clusterPassword,
                                            azureStorageAccount,
                                            azureStorageKey,
                                            azureStorageContainer,
                                            createContinerIfNotExist);

                Console.WriteLine("Starting: {0} ", DateTime.Now);
                var result = hadoop.MapReduceJob.ExecuteJob<UserQuestionsExtractionJob>();
                var info = result.Info;

                Console.WriteLine("Done: {0} ", DateTime.Now);
                Console.WriteLine("\nInfo From Server\n----------------------");
                Console.WriteLine("StandardError: " + info.StandardError);
                Console.WriteLine("\n----------------------");
                Console.WriteLine("StandardOut: " + info.StandardOut);
                Console.WriteLine("\n----------------------");
                Console.WriteLine("ExitCode: " + info.ExitCode);
            }
            catch(Exception ex)
            {
                Console.WriteLine("Error: {0} ", ex.StackTrace.ToString(CultureInfo.InvariantCulture)); 
            }
            Console.WriteLine("Press Any Key To Exit..");
            Console.ReadLine();
        }
    }


}

Now, Compile and run the above program. The ExecuteJob will upload the required binaries to your cluster, and will initiate a Hadoop Streaming Job that’ll run our Mappers on the cluster, with input from the Posts file we stored earlier in the Input folder. Our console application will submit the Job to the cloud, and will wait for the result. The Hadoop SDK will upload the map reduce binaries to the blob, and will build the required command line to execute the job (See my previous posts to understand how to do this manually). You can inspect the job by clicking Hadoop Map Reduce status tracker from the desktop short cut in the head node.

If everything goes well, you’ll see the results like this.

As you see above, you can find the output in /output/Cooking folder. If you RDP to your cluster’s head node, and check the output folder now, you should see the files created by our Map Reduce Job.

And as expected, the files contain the extracted data, which represents the UserId,QuestionId – For all questions answered by a user. If you want, you can load the data from HDFS to Hive, and then view the same with Microsoft Excel using the ODBC for Hive. See my previous articles.

Step 4 – Build the recommender And generate recommendations

As a next step, we need to build the co-occurrence matrix and run a recommender job, to convert our {UserId,QuestionId} data to recommendations. Fortunately, we don’t need to write a Map Reduce job for this. We could leverage Mahout library along with Hadoop. Read about Mahout Here

RDP to the head node of our cluster, as we need to install Mahout. Download the latest version of Mahout (0.7) as of this writing, and copy the same to the c:\app\dist folder in the head node of your cluster.

Mahout’s Recommender Job has support for multiple algorithms to build recommendations – In this case, we’ll be using SIMILARITY_COOCCURRENCE. The Algorithms Page of Mahout website has lot more information about Recommendation, Clustering and Classification algorithms. We’ll be using the files we’ve in the /output/Cooking folder to build our recommendation.

Time to run the Recommender job. Create a users.txt file and place the IDs of the users for whom you need recommendations in that file, and copy the same to HDFS.

Now, the following command should start the Recommendation Job. Remember, we’ll use the output files from our above Map Reduce job as input to the Recommender. Let us kick start the Recommendation job. This will generate output in the /recommend/ folder, for all users specified in the users.txt file. You can use the –numRecommendations switch to specify the number of recommendations you need against each user. If there is a preference relation with a user and and item, (like the number of times a user played a song), you could keep the input dataset for a recommender as {user,item,preferencevalue} – In this case, we are omitting the preference weightage.

Note: If the below command fails after re run complaining output directory already exists, just try removing the tmp folder and the output folder using hadoop fs –rmr temp and hadoop fs –rmr /recommend/

hadoop jar c:\Apps\dist\mahout-0.7\mahout-core-0.7-job.jar 
	org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE 
	--input=/output/Cooking 
	--output=/recommend/ 
	--usersFile=/data/users.txt

After the job is finished, examine the /recommend/ folder, and try printing the content in the generated file. You may see the top recommendations, against the user Ids you had in the users.txt.

So, the recommendation engine think User 1393 may answer the questions 6419, 16897 etc if we suggest the same to him. You could experiment with other Similarity classes like SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION etc to find the best results. Iterate and optimize till you are happy.

For an though experiment here is another exercise - Examine the Stack Exchange data set, and find out how you may build a Recommender to show a ‘You may also like’ questions based on the questions a user favorite?

Conclusion

In this example, we were doing a lot of manual work to upload the required input files to HDFS, and triggering the Recommender Job manually. In fact, you could automate this entire work flow leveraging Hadoop For Azure SDK. But that is for another post, stay tuned. Real life analysis has much more to do, including writing map/reducers for extracting and dumping data to HDFS, automating creation of hive tables, perform operations using HiveQL or PIG, etc. However, we just examined the steps involved in doing something meaningful with Azure, Hadoop and Mahout.

You may also access this data in your Mobile App or ASP.NET Web application, either by using Sqoop to export this to SQL Server, or by loading it to a Hive table as I explained earlier. Happy Coding and Machine Learning!! Also, if you are interested in scenarios where you could tie your existing applications with HD Insight to build end to end workflows, get in touch with me.

I suggest you to read further.

My Previous Articles on HDInsight
Mahout In Action – An awesome start if you want to get into the real details.
Mahout Tutorials – A set of good tutorials about mahout