Thursday, September 23, 2010

VinWiki Part 4: Making Recommendations with Mahout

Introduction

This is the final post in a four part series about a wine rating and recommendation Web application built using open source Java technology. The purpose of this series is to document key design and implementation decisions that can be applied to other Web applications. Please read the first, second, and third posts to get up-to-speed. You can download the project (with source) from here.

In this posting, I lay the foundation for making recommendations using Apache Mahout v. 0.3. For a thorough introduction to Mahout, I recommend Mahout in Action.

Collaborative Filtering in Mahout
Mahout's main goal is to provide scalable machine-learning libraries for classification, clustering, frequent itemset mining, and recommendations. Classification assigns a category (or class) from a fixed set of known categories to an un-categorized document. For example, some feed readers assign articles to broad categories like Sports or Politics using classification techniques. Clustering assigns documents to groups of similar documents using some notion of similarity between the documents in the group. For example, Google News uses clustering to group articles from different publishers that cover the same basic story. Frequent itemset mining determines which items, such as products in a shopping cart, typically occur together.

In this posting, I leverage the collaborative filtering features of Mahout to make wine recommendations based on ratings by VinWiki users. Collaborative filtering produces recommendations based on user preferences for items and does not require knowledge of the specific properties of the items. In contrast, content-based recommendation produces recommendations based off of intimate knowledge of the properties of items. This implies, of course, that content-based recommendation engines are domain-specific, whereas Mahout's collaborative filtering approach can work in any domain provided it has sufficient user-item preference data to work with.

For VinWiki, I experimented with three basic types of Mahout Recommenders:

  1. User Similarity
  2. Item Similarity
  3. SlopeOne
Check out the Mahout Web site for information about other more experimental recommenders, such as one based on Singular value decomposition (SVD).


To decide which one of these recommenders is best for your application, you need to consider four key questions:
  1. How to represent a user's preference for an item?
  2. What is the ratio of items to users?
  3. How do you determine the similarity between users or between items?
  4. If using UserSimilarity, what is the size of a user neighboorhood?
As I'll demonstrate below, Mahout provides a framework to allow you to answer these questions by analyzing your data.

User Preferences
What constitutes a user preference for an item in your application? Is it a boolean "like" or "dislike" or does the preference have a strength, such as "I like Chardonnay but like Sauvignon Blanc better"? The structure of user-item preference data used in Mahout is surprisingly simple: userID, itemID, score, where score represents the strength of the user's preference for the item, see org.apache.mahout.cf.taste.model.Preference. There are two concrete implementations of the Preference interface in Mahout: BooleanPreference and GenericPreference. For VinWiki, I use GenericPreference because I chose to allow users to give a score for a wine.

Basic Structure of a UserSimilarity Recommender
Let's take a look at the basic approach Mahout takes to make a UserSimilarity based recommendation using VinWiki nomenclature:
1: For all wines W that user A has NOT expressed a preference for
2:   For every other user B (in A's neighborhood) that has expressed a preference for W
3:     Compute the similarity S between user A and B
4:       Add the User B's preference X for W weighted by S to a running average preference
5: Sort Wines by weighted average preference
6: return top R wines from sorted collection as recommendations

Intuitively, this approach makes sense. From the pseudo-code above, it should be clear that we need a way to calculate the similarity S between two Users A and B, which is represented in Mahout as a org.apache.mahout.cf.taste.similarity.UserSimilarity. Also, notice that the algorithm weights recommendations by user similarity, which means that the more similar a user is to you, the more heavily their preferences count in making recommendations. Consequently, the selection of the similarity calculation is very important to making good recommendations. Mahout provides a number of concrete implementations if the UserSimilarity interface, see the org.apache.mahout.cf.taste.impl package.

In practice, most systems that need to produce recommendations have many users and calculating a similarity between all users is too computationally expensive. Thus, Mahout uses the concept of a user neighborhood to limit the number of similarity calculations to a smaller subset of similar users. This introduces another question that needs to be answered when building your recommender: What is the optimal size of the user-neighborhood for my data?

Mahout also allows you to make recommendations based on similarity between Items. Don't confuse Mahout's Item-based recommender with content-based recommenders since it is still based on user-item interactions and not the content of items.

Using Mahout in VinWiki
The main service for creating recommendations at runtime is the MahoutWineRecommender, which is an application-scoped Seam component. The MahoutWineRecommender has two dependencies injected during initialization:
  • DataModelProvider
  • RecommenderConfig

DataModel
The org.vinwiki.recommender.DataModelProvider Seam component (configured in components.xml) provides a Mahout DataModel to the recommender. For now, I'm using Mahout's FileDataModel, which as you might have guessed, reads preference data from a flat file. During startup, if this file doesn't exist, then the DataModelProvider reads wine ratings from the database and writes them to a new file.

Sample Wine Ratings Data
As this is just an example Web application, I don't have real wine ratings data. Consequently, I generated some fake data that recommends wines to sample users based on the first letter of the user name. For example, the data will cause Mahout to recommend wines that start with the letter "A" to the "A_test0" user. Here is some log output to demonstrate how the sample ratings data works:
NearestNUserNeighborhood[3,0.6,0.8,EuclideanDistanceSimilarity] 
  recommended [1887, 286, 1120, 1350, 520, 1905] wines to A_test0
     Neighbor(43) A_test30
         rated Wine 1120 91.0 pts
         rated Wine 1350 87.0 pts
     Neighbor(33) A_test20
         rated Wine 1887 88.0 pts
         rated Wine 1350 88.0 pts
     Neighbor(63) A_test50
         rated Wine 1350 90.0 pts
Notice that A_test0's neighbor's user names also start with "A_". When I created the sample ratings data, I had users rate wines that begin with the same letter a little higher than they rated other wines. You can try this yourself after deploying the application to JBoss 4.2.3 by logging in with username "A_test0" and password "P@$$w0rD" (without the quotes of course).

Refreshing the Recommender
When I first started working with Mahout, it wasn't clear how to handle data model changes at runtime because most of the built-in Mahout examples work with static, pre-existing datasets. In VinWiki, rating wines is a primary activity, so preference data will be changing frequently. Moreover, if a user provides several new ratings in a session, then they'll expect to have some recommendations based on those new ratings or they will think the site is broken and probably not return. Consequently, it's very important for this application to incorporate recent user activity into recommendations in near real-time.

Whenever a user rates a wine, the ratingHome component will raise the App.WINE_RATED_BY_USER event. The MahoutWineRecommender component observes this event and passes it to the DataModelProvider.

@Observer(App.WINE_RATED_BY_USER)
@Asynchronous
public void onWineRatedByUser(Rating r) {
    // Let the model provider know that data has changed ...
    if (dataModelProvider.updateDataModel(r.getUser().getId(), r.getWine().getId(), r.getScore())) {
        // provider indicates that we should refresh the recommender
        recommender.refresh(null);
    }
}
In response to this event, the DataModelProvider component can choose to update its internal state to reflect the change. In my current implementation, the DataModelProvider uses a nice feature provided by Mahout's FileDataModel by writing updates to a smaller "delta" file. The FileDataModel will load these additional "delta" files when it is refreshed. So that covers updating the DataModel, but what about the Recommender and its other dependencies, such as UserSimilarity and UserNeighborhood? In my implementation, the DataModelProvider makes the decision of whether the Recommender should be refreshed. This allows a more sophisticated DataModelProvider implementation to batch up changes so that the recommender is not refreshed too often as refreshing a recommender and its dependencies can be an expensive operation for large data sets.

Accounting for User Preferences
Users can de-select wines they are not interested in using the Preferences dialog. Changes to a user's preferences should be reflected in recommendations. For example, if a user indicates that they are not interested in white wine, then we should not recommend any white wines to them. Mahout allows you to inject this type of filtering on recommendations using an org.apache.mahout.cf.taste.recommender.IDRescorer.

In VinWiki, filtering recommendations by preferences is provided by the org.vinwiki.recommender.PreferencesIDRescorer class. If you revisit the pseudo-code above, then it should be obvious that the IDRescorer may need to evaluate the filter on a large number of wines. Thus, the IDRescorer should be implemented in an efficient manner; I used the Lucene native API to iterate over all wines to build and cache a Mahout FastIDSet of wine Ids that can be recommended to the current user.

// Using Lucene to initialize a Mahout FastIDSet for rescoring
int maxDoc = reader.maxDoc();
for (int docId = 0; docId < maxDoc; docId++) {
    if (reader.isDeleted(docId))
        continue;

    try {
        doc = reader.document(docId, getFieldSelector());
    } catch (Exception zzz) {
        ...
    }
    if (doc == null)
        continue;

    Long wineId = new Long(doc.get(ID));
    String type = doc.get(TYPE);
    String style = doc.get(STYLE);
    Long regionId = new Long(doc.get(REGION));

    // ask the User's Preferences object if this wine is enabled
    if (prefs.checkWineFilter(wineId, type, style, regionId)) {
        idSet.add(wineId);
    }
}

There is one subtle aspect to the current implementation in that it does not refresh during the user's session as new wines are added to the search index. In other words, you are not going to see any code that tries to update the rescorer after new wines are added to the system. Remember that our recommendations are based on user-item interactions and new wine objects are not going to have enough (if any) ratings to impact the current user's session. However, the rescorer is refreshed if the user changes their preferences.

Tip: using an IDRescorer is a simple form of content-based recommendations in that we're using specific properties of the wine objects to influence our recommendations.

RecommenderConfig
org.vinwiki.recommender.RecommenderConfig is an application-scoped Seam component (configured in components.xml) that supports common options for configuring the behavior of a Recommender.

At startup, the MahoutWineRecommender uses the DataModel and RecommenderConfig to initialize a Recommender. The Recommender is held in application-scope because it is expensive to build and should be re-used for all recommendation requests from FetchRecommended objects (see Server-side Pagination from the first posting in this series). The following code snippet gives you an idea of how to construct a User-based recommender with Mahout:

// see RecommenderConfig.java
    UserSimilarity userSimilarity = createUserSimilarity(dataModel);
    UserNeighborhood neighborhood = createUserNeighborhood(userSimilarity, dataModel);
    return new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity);

Here is an example configuration from components.xml. NOTE: You must set the fileDataModelFileName to a valid path on your server before running the sample!

<component name="dataModelProvider" auto-create="true" scope="application" 
        class="org.vinwiki.recommender.DataModelProvider">
    <property name="fileDataModelFileName">/home/thelabdude/thelabdude-blog-dev/jboss-4.2.3/bin/recommender/ratings.txt</property>
    <property name="updateFileSizeThresholdKb">10</property>
  </component>

  <component name="recommenderConfig" auto-create="true" scope="application" 
        class="org.vinwiki.recommender.RecommenderConfig">
    <property name="recommenderType">USER_SIMILARITY</property>
    <property name="similarityClassName">org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity</property>
    <property name="neighborhoodSize">2</property>
    <property name="minSimilarity">0.7</property>
    <property name="samplingRate">0.2</property>    
  </component>
Looks easy enough, but what about all those parameters to the recommenderConfig? Thankfully, Mahout provides a powerful tool to help you determine the correct values to use for each of these settings for your data - RecommenderEvaluator.

Evaluating a Mahout Recommender
It should be clear that the optimal recommender for your data requires experimentation with how you represent preferences, calculate user or item similarity, and the size of user neighborhood. Mahout provides an easy way to compare the results for different configuration options using a RecommenderEvaluator. Currently, there are two concrete RecommenderEvaluator implementations:
  • AverageAbsoluteDifferenceRecommenderEvaluator - computes the average absolute difference between predicted and actual ratings for users.
  • RMSRecommenderEvaluator - computes the "root mean squared" difference between predicted and actual ratings for users
I chose to use the RMSRecommenderEvaluator because it penalizes bad recommendations more heavily than the AverageAbsoluteDifference evaluator. When doing evaluations, the lowest score is best. Notice in the code snippet below how the RecommenderConfig (part of VinWiki) helps you run evaluations:

RecommenderConfig config = new RecommenderConfig();
config.setRecommenderType(RecommenderType.USER_SIMILARITY);
config.setSimilarityClassName(simClass.getName());
config.setNeighborhoodSize(c);
config.setMinSimilarity(minSimilarity);
config.setSamplingRate(samplingRate);

RecommenderBuilder builder = config.getBuilder();
double score = evaluator.evaluate(builder, 
                 null, // no DataModelBuilder
                 recommenderDataModel, 
                 0.8, // training data pct
                 1); // use all users

For VinWiki, I developed a Seam ComponentTest to run evaluations. At this point, the output is not as important as the process, since the results are based on simulated ratings data (VinWiki is not yet a live application with real users). This is a problem facing any new application that uses machine-learning algorithms that require real user input. One idea to get real user input is to use Amazon's Mechanical Turk service to hire users to create real user-item interactions for your application. Regardless of how you seed your application with real user data, the approach in src/test/org/vinwiki/RecommenderTest.java should still be useful to you.

Conclusion
So this concludes the four-part series on VinWiki. As you can see, integrating Mahout is easy, but it does require experimentation and tuning. You also have to be cognizant of scalability issues, such as how often to refresh your recommender. The framework I added to VinWiki should be useful for your application too. Please leave comments on my blog if you have questions or would like to suggest improvements to any of the features I discussed in any of the four posts.