This is the final post in a four part series about a wine rating and recommendation Web application built using open source Java technology. The purpose of this series is to document key design and implementation decisions that can be applied to other Web applications. Please read the first, second, and third posts to get up-to-speed. You can download the project (with source) from here.
In this posting, I lay the foundation for making recommendations using Apache Mahout v. 0.3. For a thorough introduction to Mahout, I recommend Mahout in Action.
In this posting, I leverage the collaborative filtering features of Mahout to make wine recommendations based on ratings by VinWiki users. Collaborative filtering produces recommendations based on user preferences for items and does not require knowledge of the specific properties of the items. In contrast, content-based recommendation produces recommendations based off of intimate knowledge of the properties of items. This implies, of course, that content-based recommendation engines are domain-specific, whereas Mahout's collaborative filtering approach can work in any domain provided it has sufficient user-item preference data to work with.
For VinWiki, I experimented with three basic types of Mahout Recommenders:
- User Similarity
- Item Similarity
- SlopeOne
To decide which one of these recommenders is best for your application, you need to consider four key questions:
- How to represent a user's preference for an item?
- What is the ratio of items to users?
- How do you determine the similarity between users or between items?
- If using UserSimilarity, what is the size of a user neighboorhood?
1: For all wines W that user A has NOT expressed a preference for 2: For every other user B (in A's neighborhood) that has expressed a preference for W 3: Compute the similarity S between user A and B 4: Add the User B's preference X for W weighted by S to a running average preference 5: Sort Wines by weighted average preference 6: return top R wines from sorted collection as recommendations
Intuitively, this approach makes sense. From the pseudo-code above, it should be clear that we need a way to calculate the similarity S between two Users A and B, which is represented in Mahout as a org.apache.mahout.cf.taste.similarity.UserSimilarity. Also, notice that the algorithm weights recommendations by user similarity, which means that the more similar a user is to you, the more heavily their preferences count in making recommendations. Consequently, the selection of the similarity calculation is very important to making good recommendations. Mahout provides a number of concrete implementations if the UserSimilarity interface, see the org.apache.mahout.cf.taste.impl package.
In practice, most systems that need to produce recommendations have many users and calculating a similarity between all users is too computationally expensive. Thus, Mahout uses the concept of a user neighborhood to limit the number of similarity calculations to a smaller subset of similar users. This introduces another question that needs to be answered when building your recommender: What is the optimal size of the user-neighborhood for my data?
- DataModelProvider
- RecommenderConfig
NearestNUserNeighborhood[3,0.6,0.8,EuclideanDistanceSimilarity] recommended [1887, 286, 1120, 1350, 520, 1905] wines to A_test0 Neighbor(43) A_test30 rated Wine 1120 91.0 pts rated Wine 1350 87.0 pts Neighbor(33) A_test20 rated Wine 1887 88.0 pts rated Wine 1350 88.0 pts Neighbor(63) A_test50 rated Wine 1350 90.0 pts
Whenever a user rates a wine, the ratingHome component will raise the App.WINE_RATED_BY_USER event. The MahoutWineRecommender component observes this event and passes it to the DataModelProvider.
@Observer(App.WINE_RATED_BY_USER) @Asynchronous public void onWineRatedByUser(Rating r) { // Let the model provider know that data has changed ... if (dataModelProvider.updateDataModel(r.getUser().getId(), r.getWine().getId(), r.getScore())) { // provider indicates that we should refresh the recommender recommender.refresh(null); } }
In VinWiki, filtering recommendations by preferences is provided by the org.vinwiki.recommender.PreferencesIDRescorer class. If you revisit the pseudo-code above, then it should be obvious that the IDRescorer may need to evaluate the filter on a large number of wines. Thus, the IDRescorer should be implemented in an efficient manner; I used the Lucene native API to iterate over all wines to build and cache a Mahout FastIDSet of wine Ids that can be recommended to the current user.
// Using Lucene to initialize a Mahout FastIDSet for rescoring
int maxDoc = reader.maxDoc();
for (int docId = 0; docId < maxDoc; docId++) {
if (reader.isDeleted(docId))
continue;
try {
doc = reader.document(docId, getFieldSelector());
} catch (Exception zzz) {
...
}
if (doc == null)
continue;
Long wineId = new Long(doc.get(ID));
String type = doc.get(TYPE);
String style = doc.get(STYLE);
Long regionId = new Long(doc.get(REGION));
// ask the User's Preferences object if this wine is enabled
if (prefs.checkWineFilter(wineId, type, style, regionId)) {
idSet.add(wineId);
}
}
There is one subtle aspect to the current implementation in that it does not refresh during the user's session as new wines are added to the search index. In other words, you are not going to see any code that tries to update the rescorer after new wines are added to the system. Remember that our recommendations are based on user-item interactions and new wine objects are not going to have enough (if any) ratings to impact the current user's session. However, the rescorer is refreshed if the user changes their preferences.
At startup, the MahoutWineRecommender uses the DataModel and RecommenderConfig to initialize a Recommender. The Recommender is held in application-scope because it is expensive to build and should be re-used for all recommendation requests from FetchRecommended objects (see Server-side Pagination from the first posting in this series). The following code snippet gives you an idea of how to construct a User-based recommender with Mahout:
// see RecommenderConfig.java
UserSimilarity userSimilarity = createUserSimilarity(dataModel);
UserNeighborhood neighborhood = createUserNeighborhood(userSimilarity, dataModel);
return new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity);
Here is an example configuration from components.xml. NOTE: You must set the fileDataModelFileName to a valid path on your server before running the sample!
<component name="dataModelProvider" auto-create="true" scope="application" class="org.vinwiki.recommender.DataModelProvider"> <property name="fileDataModelFileName">/home/thelabdude/thelabdude-blog-dev/jboss-4.2.3/bin/recommender/ratings.txt</property> <property name="updateFileSizeThresholdKb">10</property> </component> <component name="recommenderConfig" auto-create="true" scope="application" class="org.vinwiki.recommender.RecommenderConfig"> <property name="recommenderType">USER_SIMILARITY</property> <property name="similarityClassName">org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity</property> <property name="neighborhoodSize">2</property> <property name="minSimilarity">0.7</property> <property name="samplingRate">0.2</property> </component>
- AverageAbsoluteDifferenceRecommenderEvaluator - computes the average absolute difference between predicted and actual ratings for users.
- RMSRecommenderEvaluator - computes the "root mean squared" difference between predicted and actual ratings for users
RecommenderConfig config = new RecommenderConfig(); config.setRecommenderType(RecommenderType.USER_SIMILARITY); config.setSimilarityClassName(simClass.getName()); config.setNeighborhoodSize(c); config.setMinSimilarity(minSimilarity); config.setSamplingRate(samplingRate); RecommenderBuilder builder = config.getBuilder(); double score = evaluator.evaluate(builder, null, // no DataModelBuilder recommenderDataModel, 0.8, // training data pct 1); // use all users
For VinWiki, I developed a Seam ComponentTest to run evaluations. At this point, the output is not as important as the process, since the results are based on simulated ratings data (VinWiki is not yet a live application with real users). This is a problem facing any new application that uses machine-learning algorithms that require real user input. One idea to get real user input is to use Amazon's Mechanical Turk service to hire users to create real user-item interactions for your application. Regardless of how you seed your application with real user data, the approach in src/test/org/vinwiki/RecommenderTest.java should still be useful to you.
Great Article !! I downloaded and source and it builds without errors. On deploying the app I get the following error
ReplyDelete20:44:27,565 INFO [IndexHelper] Observed event org.vinwiki.event.INIT_SUCCESS from Thread QuartzScheduler1_Worker-3
20:44:27,585 ERROR [AsynchronousExceptionHandler] Exeception thrown whilst executing asynchronous call
java.lang.NullPointerException
at org.vinwiki.search.IndexHelper.checkIndexOnStartup(IndexHelper.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:48)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:600)
at org.jboss.seam.util.Reflections.invoke(Reflections.java:22)
at org.jboss.seam.intercept.RootInvocationContext.proceed(RootInvocationContext.java:32)
at org.jboss.seam.intercept.SeamInvocationContext.proceed(SeamInvocationContext.java:56)
at org.jboss.seam.transaction.RollbackInterceptor.aroundInvoke(RollbackInterceptor.java:28)
at org.jboss.seam.intercept.SeamInvocationContext.proceed(SeamInvocationContext.java:68)
at org.jboss.seam.core.BijectionInterceptor.aroundInvoke(BijectionInterceptor.java:77)
at org.jboss.seam.intercept.SeamInvocationContext.proceed(SeamInvocationContext.java:68)
Are you using JBoss 4.2.x? By the looks of this stack trace, something went horribly wrong during app startup. Also, did you set the fileDataModelFileName parameter to a valid path on your server before running the sample?
ReplyDeleteDeployed this on Jboss 4.2.3 and it works!! Was having trouble with JBoss 5.1. Fails to create the lucene index. Any ideas on how to get this to work on Jboss 5.1 ? I am going to try by replacing the jars for lucene and hibernate search with the latest stable versions.
ReplyDeleteThanks !!
I think it has to do with the EntityManagerFactory not being deployed correctly on 5.x.
ReplyDelete