thelabdude: 2010

Thursday, September 23, 2010

VinWiki Part 4: Making Recommendations with Mahout

Introduction

This is the final post in a four part series about a wine rating and recommendation Web application built using open source Java technology. The purpose of this series is to document key design and implementation decisions that can be applied to other Web applications. Please read the first, second, and third posts to get up-to-speed. You can download the project (with source) from here.

In this posting, I lay the foundation for making recommendations using Apache Mahout v. 0.3. For a thorough introduction to Mahout, I recommend Mahout in Action.

Collaborative Filtering in Mahout

Mahout's main goal is to provide scalable machine-learning libraries for classification, clustering, frequent itemset mining, and recommendations. Classification assigns a category (or class) from a fixed set of known categories to an un-categorized document. For example, some feed readers assign articles to broad categories like Sports or Politics using classification techniques. Clustering assigns documents to groups of similar documents using some notion of similarity between the documents in the group. For example, Google News uses clustering to group articles from different publishers that cover the same basic story. Frequent itemset mining determines which items, such as products in a shopping cart, typically occur together.

In this posting, I leverage the collaborative filtering features of Mahout to make wine recommendations based on ratings by VinWiki users. Collaborative filtering produces recommendations based on user preferences for items and does not require knowledge of the specific properties of the items. In contrast, content-based recommendation produces recommendations based off of intimate knowledge of the properties of items. This implies, of course, that content-based recommendation engines are domain-specific, whereas Mahout's collaborative filtering approach can work in any domain provided it has sufficient user-item preference data to work with.

For VinWiki, I experimented with three basic types of Mahout Recommenders:

User Similarity
Item Similarity
SlopeOne

Check out the Mahout Web site for information about other more experimental recommenders, such as one based on Singular value decomposition (SVD).

To decide which one of these recommenders is best for your application, you need to consider four key questions:

How to represent a user's preference for an item?
What is the ratio of items to users?
How do you determine the similarity between users or between items?
If using UserSimilarity, what is the size of a user neighboorhood?

As I'll demonstrate below, Mahout provides a framework to allow you to answer these questions by analyzing your data.

User Preferences

What constitutes a user preference for an item in your application? Is it a boolean "like" or "dislike" or does the preference have a strength, such as "I like Chardonnay but like Sauvignon Blanc better"? The structure of user-item preference data used in Mahout is surprisingly simple: userID, itemID, score, where score represents the strength of the user's preference for the item, see org.apache.mahout.cf.taste.model.Preference. There are two concrete implementations of the Preference interface in Mahout: BooleanPreference and GenericPreference. For VinWiki, I use GenericPreference because I chose to allow users to give a score for a wine.

Basic Structure of a UserSimilarity Recommender

Let's take a look at the basic approach Mahout takes to make a UserSimilarity based recommendation using VinWiki nomenclature:

1: For all wines W that user A has NOT expressed a preference for
2:   For every other user B (in A's neighborhood) that has expressed a preference for W
3:     Compute the similarity S between user A and B
4:       Add the User B's preference X for W weighted by S to a running average preference
5: Sort Wines by weighted average preference
6: return top R wines from sorted collection as recommendations

Intuitively, this approach makes sense. From the pseudo-code above, it should be clear that we need a way to calculate the similarity S between two Users A and B, which is represented in Mahout as a org.apache.mahout.cf.taste.similarity.UserSimilarity. Also, notice that the algorithm weights recommendations by user similarity, which means that the more similar a user is to you, the more heavily their preferences count in making recommendations. Consequently, the selection of the similarity calculation is very important to making good recommendations. Mahout provides a number of concrete implementations if the UserSimilarity interface, see the org.apache.mahout.cf.taste.impl package.

In practice, most systems that need to produce recommendations have many users and calculating a similarity between all users is too computationally expensive. Thus, Mahout uses the concept of a user neighborhood to limit the number of similarity calculations to a smaller subset of similar users. This introduces another question that needs to be answered when building your recommender: What is the optimal size of the user-neighborhood for my data?

Mahout also allows you to make recommendations based on similarity between Items. Don't confuse Mahout's Item-based recommender with content-based recommenders since it is still based on user-item interactions and not the content of items.

Using Mahout in VinWiki

The main service for creating recommendations at runtime is the MahoutWineRecommender, which is an application-scoped Seam component. The MahoutWineRecommender has two dependencies injected during initialization:

DataModelProvider
RecommenderConfig

DataModel

The org.vinwiki.recommender.DataModelProvider Seam component (configured in components.xml) provides a Mahout DataModel to the recommender. For now, I'm using Mahout's FileDataModel, which as you might have guessed, reads preference data from a flat file. During startup, if this file doesn't exist, then the DataModelProvider reads wine ratings from the database and writes them to a new file.

Sample Wine Ratings Data

As this is just an example Web application, I don't have real wine ratings data. Consequently, I generated some fake data that recommends wines to sample users based on the first letter of the user name. For example, the data will cause Mahout to recommend wines that start with the letter "A" to the "A_test0" user. Here is some log output to demonstrate how the sample ratings data works:

NearestNUserNeighborhood[3,0.6,0.8,EuclideanDistanceSimilarity] 
  recommended [1887, 286, 1120, 1350, 520, 1905] wines to A_test0
     Neighbor(43) A_test30
         rated Wine 1120 91.0 pts
         rated Wine 1350 87.0 pts
     Neighbor(33) A_test20
         rated Wine 1887 88.0 pts
         rated Wine 1350 88.0 pts
     Neighbor(63) A_test50
         rated Wine 1350 90.0 pts

Notice that A_test0's neighbor's user names also start with "A_". When I created the sample ratings data, I had users rate wines that begin with the same letter a little higher than they rated other wines. You can try this yourself after deploying the application to JBoss 4.2.3 by logging in with username "A_test0" and password "P@$$w0rD" (without the quotes of course).

Refreshing the Recommender

When I first started working with Mahout, it wasn't clear how to handle data model changes at runtime because most of the built-in Mahout examples work with static, pre-existing datasets. In VinWiki, rating wines is a primary activity, so preference data will be changing frequently. Moreover, if a user provides several new ratings in a session, then they'll expect to have some recommendations based on those new ratings or they will think the site is broken and probably not return. Consequently, it's very important for this application to incorporate recent user activity into recommendations in near real-time.

Whenever a user rates a wine, the ratingHome component will raise the App.WINE_RATED_BY_USER event. The MahoutWineRecommender component observes this event and passes it to the DataModelProvider.

@Observer(App.WINE_RATED_BY_USER)
@Asynchronous
public void onWineRatedByUser(Rating r) {
    // Let the model provider know that data has changed ...
    if (dataModelProvider.updateDataModel(r.getUser().getId(), r.getWine().getId(), r.getScore())) {
        // provider indicates that we should refresh the recommender
        recommender.refresh(null);
    }
}

In response to this event, the DataModelProvider component can choose to update its internal state to reflect the change. In my current implementation, the DataModelProvider uses a nice feature provided by Mahout's FileDataModel by writing updates to a smaller "delta" file. The FileDataModel will load these additional "delta" files when it is refreshed. So that covers updating the DataModel, but what about the Recommender and its other dependencies, such as UserSimilarity and UserNeighborhood? In my implementation, the DataModelProvider makes the decision of whether the Recommender should be refreshed. This allows a more sophisticated DataModelProvider implementation to batch up changes so that the recommender is not refreshed too often as refreshing a recommender and its dependencies can be an expensive operation for large data sets.

Accounting for User Preferences

Users can de-select wines they are not interested in using the Preferences dialog. Changes to a user's preferences should be reflected in recommendations. For example, if a user indicates that they are not interested in white wine, then we should not recommend any white wines to them. Mahout allows you to inject this type of filtering on recommendations using an org.apache.mahout.cf.taste.recommender.IDRescorer.

In VinWiki, filtering recommendations by preferences is provided by the org.vinwiki.recommender.PreferencesIDRescorer class. If you revisit the pseudo-code above, then it should be obvious that the IDRescorer may need to evaluate the filter on a large number of wines. Thus, the IDRescorer should be implemented in an efficient manner; I used the Lucene native API to iterate over all wines to build and cache a Mahout FastIDSet of wine Ids that can be recommended to the current user.

// Using Lucene to initialize a Mahout FastIDSet for rescoring
int maxDoc = reader.maxDoc();
for (int docId = 0; docId < maxDoc; docId++) {
    if (reader.isDeleted(docId))
        continue;

    try {
        doc = reader.document(docId, getFieldSelector());
    } catch (Exception zzz) {
        ...
    }
    if (doc == null)
        continue;

    Long wineId = new Long(doc.get(ID));
    String type = doc.get(TYPE);
    String style = doc.get(STYLE);
    Long regionId = new Long(doc.get(REGION));

    // ask the User's Preferences object if this wine is enabled
    if (prefs.checkWineFilter(wineId, type, style, regionId)) {
        idSet.add(wineId);
    }
}

There is one subtle aspect to the current implementation in that it does not refresh during the user's session as new wines are added to the search index. In other words, you are not going to see any code that tries to update the rescorer after new wines are added to the system. Remember that our recommendations are based on user-item interactions and new wine objects are not going to have enough (if any) ratings to impact the current user's session. However, the rescorer is refreshed if the user changes their preferences.

Tip: using an IDRescorer is a simple form of content-based recommendations in that we're using specific properties of the wine objects to influence our recommendations.

RecommenderConfig

org.vinwiki.recommender.RecommenderConfig is an application-scoped Seam component (configured in components.xml) that supports common options for configuring the behavior of a Recommender.

At startup, the MahoutWineRecommender uses the DataModel and RecommenderConfig to initialize a Recommender. The Recommender is held in application-scope because it is expensive to build and should be re-used for all recommendation requests from FetchRecommended objects (see Server-side Pagination from the first posting in this series). The following code snippet gives you an idea of how to construct a User-based recommender with Mahout:

// see RecommenderConfig.java
    UserSimilarity userSimilarity = createUserSimilarity(dataModel);
    UserNeighborhood neighborhood = createUserNeighborhood(userSimilarity, dataModel);
    return new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity);

Here is an example configuration from components.xml. NOTE: You must set the fileDataModelFileName to a valid path on your server before running the sample!

<component name="dataModelProvider" auto-create="true" scope="application" 
        class="org.vinwiki.recommender.DataModelProvider">
    <property name="fileDataModelFileName">/home/thelabdude/thelabdude-blog-dev/jboss-4.2.3/bin/recommender/ratings.txt</property>
    <property name="updateFileSizeThresholdKb">10</property>
  </component>

  <component name="recommenderConfig" auto-create="true" scope="application" 
        class="org.vinwiki.recommender.RecommenderConfig">
    <property name="recommenderType">USER_SIMILARITY</property>
    <property name="similarityClassName">org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity</property>
    <property name="neighborhoodSize">2</property>
    <property name="minSimilarity">0.7</property>
    <property name="samplingRate">0.2</property>    
  </component>

Looks easy enough, but what about all those parameters to the recommenderConfig? Thankfully, Mahout provides a powerful tool to help you determine the correct values to use for each of these settings for your data - RecommenderEvaluator.

Evaluating a Mahout Recommender

It should be clear that the optimal recommender for your data requires experimentation with how you represent preferences, calculate user or item similarity, and the size of user neighborhood. Mahout provides an easy way to compare the results for different configuration options using a RecommenderEvaluator. Currently, there are two concrete RecommenderEvaluator implementations:

AverageAbsoluteDifferenceRecommenderEvaluator - computes the average absolute difference between predicted and actual ratings for users.
RMSRecommenderEvaluator - computes the "root mean squared" difference between predicted and actual ratings for users

I chose to use the RMSRecommenderEvaluator because it penalizes bad recommendations more heavily than the AverageAbsoluteDifference evaluator. When doing evaluations, the lowest score is best. Notice in the code snippet below how the RecommenderConfig (part of VinWiki) helps you run evaluations:

RecommenderConfig config = new RecommenderConfig();
config.setRecommenderType(RecommenderType.USER_SIMILARITY);
config.setSimilarityClassName(simClass.getName());
config.setNeighborhoodSize(c);
config.setMinSimilarity(minSimilarity);
config.setSamplingRate(samplingRate);

RecommenderBuilder builder = config.getBuilder();
double score = evaluator.evaluate(builder, 
                 null, // no DataModelBuilder
                 recommenderDataModel, 
                 0.8, // training data pct
                 1); // use all users

For VinWiki, I developed a Seam ComponentTest to run evaluations. At this point, the output is not as important as the process, since the results are based on simulated ratings data (VinWiki is not yet a live application with real users). This is a problem facing any new application that uses machine-learning algorithms that require real user input. One idea to get real user input is to use Amazon's Mechanical Turk service to hire users to create real user-item interactions for your application. Regardless of how you seed your application with real user data, the approach in src/test/org/vinwiki/RecommenderTest.java should still be useful to you.

Conclusion

So this concludes the four-part series on VinWiki. As you can see, integrating Mahout is easy, but it does require experimentation and tuning. You also have to be cognizant of scalability issues, such as how often to refresh your recommender. The framework I added to VinWiki should be useful for your application too. Please leave comments on my blog if you have questions or would like to suggest improvements to any of the features I discussed in any of the four posts.

Thursday, July 8, 2010

VinWiki Part 3: Authentication with Facebook Connect and Sharing Content with Friends

Introduction

This is the third post in a four part series about a wine rating and recommendation Web application built using open source Java technology. The purpose of this series is to document key design and implementation decisions that can be applied to other Web applications. Please read the first and second posts to get up-to-speed. You can download the project (with source) from here.

In this posting, I implement several common features that should help users find and contribute to your application. Specifically, I integrate with Facebook Connect (based on OAuth 2) to allow Facebook users to instantly register and authenticate using their Facebook profile. From there, I integrate the Facebook Like social plug-in which allows users to share content in your application with their friends on Facebook.

Authentication using Facebook Connect

The Facebook team has made integrating Facebook Connect into your application very easy. There are open source Java libraries available for integrating with Facebook, however I found it easier to just use the JavaScript SDK. Let's go through the process in five simple steps:

I. Register Your Application with Facebook

First, you need to register an application in Facebook to get a unique Application ID (please use something other than "VinWiki" for your application name since I'll be using that one in the near future). Update resources/WEB-INF/components.xml to set the facebookAppId property on the app component:

<component name="app" auto-create="true" scope="application" class="org.vinwiki.App">
    ...
    <property name="facebookAppId">ENTER_YOUR_FB_APPLICATION_ID_HERE</property>
  </component>

II. Initialize the Facebook JavaScript Library

Second, you need to initialize the Facebook JavaScript library when your page loads. For this, I created a new Facelets include file view/WEB-INF/facelets/facebook.xhtml and loaded it into the footer in my layout template view/layout/template.xhtml:

<h:panelGroup rendered="#{app.isFacebookEnabled()}">
    <ui:include src="/WEB-INF/facelets/facebook.xhtml"/>
  </h:panelGroup>

In facebook.xhtml, I have the following JavaScript:

(function() {
    var e = document.createElement('script'); e.async = true;
    e.src = document.location.protocol + '//connect.facebook.net/en_US/all.js';
    document.getElementById('fb-root').appendChild(e);
  }());

This function, borrowed from the Facebook developer documentation, asynchronously loads the Facebook JavaScript file into your page.

III. Register a JavaScript callback handler for Facebook session events

Once the JavaScript library is loaded, the window.fbAsyncInit function is called automatically.

window.fbAsyncInit = function() {    
    FB.init({appId:'#{app.facebookAppId}', status:true, cookie:true, xfbml:true});
    FB.Event.subscribe('auth.sessionChange', function(response) {
      if (response.session) {
        // Login successful
        var uid = response.session.uid;
        FB.api('/me', function(resp) {
          onFbLogin(uid, resp.email, resp.name, resp.first_name, resp.last_name, resp.gender);
        });
      } else {
        // The user has logged out, and the cookie has been cleared
        onFbLogout();
      }
    });
  };

After initializing the library (see FB.init), the application registers a listener for auth.sessionChange events (login or logout). On login, I use Facebook's Graph API to get some basic information about the current user (see FB.api). In the response callback handler for FB.api('/me'), I invoke a JavaScript function onFbLogin that executes the #{guest.onFbLogin()} action:

<a4j:form prependId="false">
  <s:token/>
  <a4j:jsFunction immediate="true" name="onFbLogin" ajaxSingle="true" action="#{guest.onFbLogin()}">
    ... params here ...
  </a4j:jsFunction>
</a4j:form>

Notice that I'm using Seam's <s:token/> tag to prevent cross-site request forgery since the #{guest.onFbLogin()} action simply trusts that the user was authenticated by Facebook on the client-side. For an overview of the thinking behind the <s:token/> tag, I refer you to Dan Allen's post at http://seamframework.org/Community/NewComponentTagStokenAimedToGuardAgainstCSRF. However, please realize that you must set javax.faces.STATE_SAVING_METHOD to "server" in web.xml for this method to secure your forms:

<context-param>
    <param-name>javax.faces.STATE_SAVING_METHOD</param-name>
    <param-value>server</param-value>
  </context-param>

The action handler on the server side is straight-forward because Seam supports alternative authentication mechanisms out-of-the-box. Specifically, all you need to do is invoke the acceptExternallyAuthenticatedPrincipal method of the org.jboss.seam.security.Identity object. I utilized the existing guest component because it handles other guest related actions such as register and login (see src/hot/org/vinwiki/user/GuestSupport.java)

Identity.instance().acceptExternallyAuthenticatedPrincipal(new FacebookPrincipal(user.getUserName()));
    Contexts.getSessionContext().set("currentUser", user);
    Events.instance().raiseEvent(Identity.EVENT_POST_AUTHENTICATE, Identity.instance());

I also raise the Identity.EVENT_POST_AUTHENTICATE event manually so that my nav component can re-configure the default for the authenticated user instead of showing the guest view.

IV. Show Facebook Connect Button on Login Panel

Lastly, we need to let users know that they login (or register) using their Facebook credentials. This is accomplished with the <fb:login-button> tag, see view/WEB-INF/facelets/guestSupport.xhtml.

<fb:login-button perms="email,publish_stream">
    <fb:intl>Login with Facebook</fb:intl>
  </fb:login-button>

The new RichFaces Login dialog looks like:
Login Dialog with Facebook Connect

Notice that VinWiki requests access to the email and publish_stream extended permissions. The publish_stream permission allows users to share wines of interest found on VinWiki with their friends on Facebook. When accessing VinWiki for the first time, users will see a dialog that allows them to grant permissions to the application:
Facebook Extended Permissions

V. Logout

It's doubtful whether many users will ever explicitly logout of your site unless they are accessing it from a public computer. Consequently, you'll want to keep your session timeout value as low as possible. That said, you still need to offer the ability to logout. In view/WEB-INF/facelets/headerControls.xhtml, the logout action is implemented using a simple JSF commandLink:

<h:commandLink rendered="#{nav.isFbSession()}" onclick="FB.logout();" styleClass="hdrLink">
  <h:outputText value="Logout"/>
</h:commandLink>

The magic is in our sessionChange event listener discussed above. The Facebook JavaScript function FB.logout() triggers an auth.sessionChange event, which in turn calls onFbLogout() to execute the #{guest.onFbLogout()} action on the server.

So that covers authentication using Facebook Connect. There are many other ways to integrate your application with Facebook. In the next section, I'll implement a way for users to share content in your application with their friends on Facebook. For this feature, it is helpful to have bookmarkable URLs.

Sharing Content with Friends

In VinWiki, users may want to share specific wines of interest with their friends on Facebook. For this, I used Facebook's Like social plug-in. There are other options, including just posting a shared link to the user's activity stream. There's not much to integrating the Like social plug-in into your page once you've accounted for bookmarkable URLs (see posting 1). The <fb:like> tag will use the URL of the current page if you don't specify an href attribute. However, I want to make sure the URL that is shared with Facebook is as clean as possible. Thus, I introduced a new setting for the org.vinwiki.App component named baseUrl. You should change this to match your server in resources/WEB-INF/components.xml:

<component name="app" auto-create="true" scope="application" class="org.vinwiki.App">
    ...
    <property name="baseUrl">http://192.168.1.2:8080/vinwiki/</property>
  </component>

I also decided to add the Open Graph meta tags to the header of my wine details page, view/wine.xhtml:

<meta property="og:title" content="#{currentWine.fullName}"/>
  <meta property="og:type" content="drink"/>
  <meta property="og:url" content="#{viewWine.openGraphUrl}"/>
  <meta property="og:site_name" content="VinWiki"/>
  <meta property="og:description" content="#{jsf.truncate(currentWine.description,100,'...')}"/>

Presumably, when you share a page that supports the Open Group protocol, Facebook is able to present this additional metadata to your friends. Of course, the URL needs to be public before this will actually work ;-)

What's Next?

In the next and final post in this series, I'll integrate Mahout for making wine recommendations and discuss some considerations for scaling the application.

Monday, June 14, 2010

VinWiki Part 2: Full-text Search with Hibernate Search and Lucene

Introduction

This is the second post in a four part series about a wine recommendation Web application built using open source Java technology. The purpose of this series is to document key design and implementation decisions that can be applied to other Web applications. Please read the first post for an overview of the project. You can download the project (with source) from here.

In this posting, I leverage Hibernate Search and Lucene to provide full-text search for wines of interest. This is post is merely an attempt to complement the already existing documentation on Lucene and Hibernate Search. If you are familiar with Hibernate and are new to Lucene, then I recommend starting out with the online manual for Hibernate Search 3.1.1 GA. In addition, I highly recommend reading Hibernate Search in Action and Lucene in Action (Second Edition); both are extremely well-written books by experts in each field.

Basic Requirements

Full-text search allows users to find wines using simple keywords, such as "zinfandel" or "dry creek valley". Users have come to expect certain features from a full-text search engine. The following table summarizes the key features our wine search engine will support:

Feature	Description
Paginated search results (see first post for solution)	Return the first 5-10 results, ordered by relevance on the first page of the search results. Allow users to retrieve more results by advancing the page navigator. Allows users to set a bounded preference for how many items to show on earch page.
Query term highlighting (view solution)	Highlight query terms in search results to allow users to quickly scan the results for the most relevant items.
Spell correction (view solution)	Ability to detect and correct for misspelled terms in the query.
Type-ahead suggestions (view solution)	Show a drop-down selection box containing suggested search terms after the user has typed at least 3 characters.
Advanced search form (view solution)	Allow users to fine-tune the search with an advanced search form.

Setup

From the previous post, you'll recall that I'm using Hibernate to manage my persistent objects. As such, I've chosen to leverage Hibernate Search (referred to as HS hereafter) to integrate my Hibernate-based object model with Lucene. For now, I'm using Hibernate Search v. 3.1.1 GA with Hibernate Core 3.3.1 GA. On the Lucene side, I'm using version 2.9.2 and a few classes from Solr 1.4. In the last posting in this series, we'll upgrade to the latest Hibernate Search and Core, which rely on JPA 2 and thus will require a move up to JBoss 6.x. As far as I know, the latest HS and Core classes have trouble on JBoss 4.2 and 5.1 because of their dependency on JPA 2 (let me know if you get it working).

Why not use the database to search?

Some may be wondering why I'm using Lucene instead of doing full-text search in my database? While possible, I feel the database is not the best tool for full-text searching. For an excellent treatment of the mis-match between relational databases and full-text search, I recommend reading Hibernate Search in Action. The first chapter provides a strong case for using Lucene instead of your database for full-text search. Just in terms of scalability, the database is the most expensive and complex layer to scale in most applications. Offloading searches to a local Lucene running on each node in your cluster reduces the amount of work your database is performing and ensures searches return quickly. Moreover, distributing a Lucene index across a cluster of app servers is almost trivial since Hibernate Search has built-in clustering support.

Project Configuration Changes

The following JAR files need to be included in your Web application's LIB directory (/WEB-INF/lib):

hibernate-search.jar
    lucene-core-2.9.2.jar
    lucene-highlighter-2.9.2.jar
    lucene-misc-2.9.2.jar
    lucene-snowball-2.9.2.jar
    lucene-spatial-2.9.2.jar
    lucene-spellchecker-2.9.2.jar
    lucene-memory-2.9.2.jar
    solr-core-1.4.0.jar
    solr-solrj-1.4.0.jar

How does Seam know to use Hibernate Search's FullTextEntityManager?

Seam's org.jboss.seam.persistence.HibernatePersistenceProvider automatically detects if Hibernate Search is available on the classpath. If HS is available, then Seam uses the org.jboss.seam.persistence.FullTextEntityManagerProxy instead of the default EntityManagerProxy, meaning that you will have access to a FullTextEntityManager wherever you have a Seam in-jected EntityManager. You also need to add a few more properties to your persistence deployment descriptor (resources/META-INF/persistence-*.xml):

<property name="hibernate.search.default.directory_provider" value="org.hibernate.search.store.FSDirectoryProvider"/>
  <property name="hibernate.search.default.indexBase" value="lucene_index"/>
  <property name="hibernate.search.reader.strategy" value="shared"/>
  <property name="hibernate.search.worker.execution" value="sync"/>

We'll make some adjustments to these settings as the project progresses, but these will suffice for now.

Indexing

The first step in providing full-text search capabilities is to index the content you want to search. In most cases, our users will want to find Wine objects and to a lesser extent Winery objects. Let's start by indexing Wine objects.

Indexing Wine Objects

I'll tackle indexing Wine objects in a few passes, progressively adding features, so let's start with the basics. First, we tell HS to index Wine objects using the @Indexed annotation on the class. As for what to search, I tend to favor using a single field to hold all searchable text for each object because it simplifies working with other Lucene extensions, such as the More Like This and term highlighting features.

@Field(name=DEFAULT_SEARCH_FIELD, index=Index.TOKENIZED)
    @Analyzer(definition="wine_stem_en")
    public String getContent() {
        // return one string containing all searchable text for this object
    }

Notice that the content field is Index.TOKENIZED during indexing. This means that the String value returned by the getContent method will be broken into a stream of tokens using a Lucene Analyzer; each token returned by the Analyzer will be searchable. Your choice of Analyzer depends on the type of text you are processing, which in our case is English text provided by our users when creating Wine objects. Hibernate Search leverages the text analysis framework provided by Solr. You can specify your Analyzer using the @Analyzer annotation. Behind the scenes, Hibernate Search creates the Analyzer using the Factory defined by an @AnalyzerDef class annotation. In our case

@AnalyzerDef(name = "wine_stem_en",
        tokenizer = @TokenizerDef(factory = org.vinwiki.search.Lucene29StandardTokenizerFactory.class),
        filters = {
          @TokenFilterDef(factory = StandardFilterFactory.class),
          @TokenFilterDef(factory = LowerCaseFilterFactory.class),
          @TokenFilterDef(factory = StopFilterFactory.class),
          @TokenFilterDef(factory = EnglishPorterFilterFactory.class)
    })

This looks more complicated than it is ... Let's take a closer look at the @AnalyzerDef annotation:

  name = "wine_stem_en"

This gives our analyzer definition a name so we can refer to it in the @Analyzer annotation. This will also come in handy when we start parsing user queries using Lucene's QueryParser, which also requires an Analyzer.

  tokenizer = 
  @TokenizerDef(factory = org.vinwiki.search.Lucene29StandardTokenizerFactory.class)

Specify a factory for Tokenizer instances. In this case, I'm supplying a custom factory class that creates a Lucene 2.9.x StandardTokenizer as the default factory provided by Solr creates a Lucene 2.4 StandardTokenizer (most likely for backwards compatibility).

  filters = {
    @TokenFilterDef(factory = StandardFilterFactory.class),
    @TokenFilterDef(factory = LowerCaseFilterFactory.class),
    @TokenFilterDef(factory = StopFilterFactory.class),
    @TokenFilterDef(factory = EnglishPorterFilterFactory.class)
  }

A tokenizer can pass tokens through a filter-chain to perform various transformations on each token. In this case, we're passing the tokens through four filters, in the order listed above.

StandardFilter: Removes dots from acronyms and 's from the end of tokens. Works only on typed tokens, i.e., those produced by StandardTokenizer or equivalent.
LowerCaseFilter: Lowercases letters in tokens
StopFilter: Discards common English words like "an" "the" and "of".
EnglishPorterFilter: Extracts the stem for each token using Porter Stemmer implemented by the Lucene Snowball extension.

You can index other fields, such as the wine type and style using the @Field annotation.

@Field(name="type",index=Index.UN_TOKENIZED,store=Store.YES)    
    public WineType getType() {
        return this.type;
    }

Building the Index

If you worked through the first post, then you should already have a database containing 2,025 wines. Hibernate Search will automatically index new Wine objects for us when the insert transaction commits, but we need to manually index the existing wines. During startup, the org.vinwiki.search.IndexHelper component counts the number of documents in the index and re-builds the index from the objects in the database if needed. During startup, you should see log output similar to:

[IndexHelper] Observed event org.vinwiki.event.REBUILD_INDEX from Thread QuartzScheduler1_Worker-8
[IndexHelper] Re-building Wine index for 2025 objects.
[IndexHelper] Flushed index update 300 from Thread Quartz ...
[IndexHelper] Flushed index update 600 from Thread Quartz ...
[IndexHelper] Flushed index update 900 from Thread Quartz ...
[IndexHelper] Flushed index update 1200 from Thread Quartz ...
[IndexHelper] Flushed index update 1500 from Thread Quartz ...
[IndexHelper] Flushed index update 1800 from Thread Quartz ...
[IndexHelper] Took 4094 (ms) to re-build the index containing 2025 documents.

Alright, with a few lines of code and some clever annotations, we now have a full-text search index for Wine objects. Let's do some searching!

Basic Search Box

From the first post, we already have a server-side pagination framework in place for displaying Wine objects. To integrate full-text search capabilities, we simply need to implement the org.vinwiki.action.PagedDataFetcher interface in terms of Hibernate Search. Here is a simple implementation (we'll add more features later):
Search.java source listing

This is sufficient to get us searching quickly.

Next, we need to pass the user's query from the search box to our nav component. In view/WEB-INF/facelets/searchControls.xhtml, update the JSF inputText tag to bind the user's query to an Event-scoped component named searchCriteria:

<h:inputText id="basicSearchQuery" styleClass="searchField" value="#{searchCriteria.basicSearchQuery}"/>

The searchCriteria component is in-jected into nav, which passes it on to an instance of org.vinwiki.search.Search which implements the PagedDataFetcher interface. You may be thinking that having a separate component for a single input field is over-kill, but the searchCriteria component will also come in very handy once we add an advanced search form. Here is what happens in nav:

@In(required=false) private SearchCriteria searchCriteria;
    ...
    public void doSearch() {
        cleanup();
        dataFetcher = new FeatureRichSearch(log, searchCriteria);
        initFirstPageOfItems(); // fetches the first page of Items
    }

Execute Search on Enter

There is the obligatory Search button next to my search input field, but most user's will just hit enter to execute the search. RichFaces makes this trivial to support using the <rich:hotKey> tag:

<rich:hotKey key="return" selector="#basicSearchQuery" 
          handler="#{rich:element('searchBtn')}.onclick();return false;"/>

The <rich:hotKey> tag binds a JavaScript event listener for the search box to submit the form when the user hits "enter". Be sure to use the selector attribute to limit the listener to the search box and not all input text boxes on the page! Any valid jQuery selector will do ...

At this point, re-compile and re-deploy these changes (at the command-line: seam clean reexplode). A search for "zinfandel" results in:
Search results for zinfandel
Hopefully, you get something similar in your environment ;-) Yes, those images are terribly ugly right now ... They were pulled dynamically from Freebase; if this were a real production application, then I'd work on getting better images. Next, we'll add some more features to the search engine, such as query term highlighting, spell correction, more like this, and filters. These additional features are implemented in the org.vinwiki.search.FeatureRichSearch class.

Highlighting Query Terms in Results

Highlighting query terms in results is a common and very useful feature of most search engines. The contrib/highlighter module in Lucene makes this feature easy to implement. There are two aspects to consider when displaying results to the user. First, we want to display the best "fragment" from the document text for the specific query. In other words, rather than just showing the first X characters of the description, we should show the best section of the description matching the user's query. Second, highlight the query terms in the chosen fragment. However, most of the description text for our Wine objects is short enough to display the full description for each result. Based on my current UI layout, 390 is a safe maximum length for fragments as it is long enough to provide useful information about each wine, yet short enough to keep from cluttering up the screen with text. This is something you'll have to work out for your application.

Highlighting Fragments with Hibernate Search

Recall that we stuff all the searchable text information about a Wine into a single searchable field "content". While good for searching, it's probably not something you want to display to your users directly. Instead, I've chosen to highlight terms in the wine description only. Here is how to construct a Highlighter (from the Lucene org.apache.lucene.search.highlight package):

// Pass the Lucene Query to a Scorer that extracts matching spans for the query
    // and then uses these spans to score each fragment    
    QueryScorer scorer = new QueryScorer(luceneQuery, Wine.DEFAULT_SEARCH_FIELD);
    // Highlight using a CSS style    
    SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span class='termHL'>", "</span>");
    Highlighter highlighter = new Highlighter(formatter, scorer);
    highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, Nav.MAX_FRAGMENT_LEN));

Also notice that I get a reference to the "wine_stem_en" Analzyer using:

Analyzer analyzer = searchFactory.getAnalyzer("wine_stem_en");

While iterating over the results, we pass the Analyzer and the actual description text to the Highlighter. The observant reader will notice that I didn't "stem" the description text during indexing, but now I am stemming the description text for highlighting. You'll see why I'm taking this approach in the next section when I add spell correction.

highlightedText = highlighter.getBestFragment(analyzer, Wine.SPELL_CHECK_SEARCH_FIELD, description);

FeatureRichSearch saves the dynamically highlighted fragments in a Map because fragments are specific to each query. If the query changes, then so must the fragments for the results. The getResultDisplayText method on the nav component interfaces with the FeatureRichSearch dataFetcher to get fragments for search results.

Spell Correction

For spell correction, I'm using the contrib/spellchecker module in Lucene (see http://wiki.apache.org/lucene-java/SpellChecker), which is a good place to start, but you should realize that handling spelling mistakes in search is a complex problem so this is by no means the "best" solution.

Spell Correction Dictionary

To begin, you need a dictionary of correctly spelled terms from which to base spell corrections on. The spellchecker module builds a supplementary index from the terms in your main index using fields based on n-grams of each term. Internally, SpellChecker creates several n-gram fields for each word depending on word length. Here is a screenshot of the word "zinfandel" in the SpellChecker index courtesy of Luke:
SpellChecker Index in Luke

Recall that we're stemming terms in our default search field "content". If you build the spell checker index from this field, the dictionary will only contain stemmed terms. This has un-desired side-effect that the corrected term will look misspelled to the user. For example, if you search for "strawbarry", the spell checker will probably recommend "strawberri", which is good, but we really want to show the user "Did you mean 'strawberry'?". Thus, we need to base our spell checker index off non-stemmed text. When we ask the spell checker to suggest a term for "strawbarry", then it will return "strawberry". When we query the search index, we need to query for "strawberri", which is why I pass the "wine_stem_en" Analyzer to the QueryParser after applying the spell correction process.

Hibernate Search Automatic Indexing and Spell Correction

Lucene's SpellChecker builds a supplementary index from the terms in your main index. If your main index changes, e.g. after adding a new entity, then you need to update the spell correction index. From discussing this within the HS community, there's nothing built into HS to help you determine when to update the spell checker index, especially if you're using hibernate.search.worker.execution=async. In other words, you don't know when Hibernate Search is finished updating the Lucene index. You have a couple of options to consider here: 1) update the spell index incrementally as new content is added (or updated), or 2) update the index periodically in a background job. This depends on your requirements and how much you trust the source of the changes to the main index. For now, I'm using Seam Events to incrementally update the spell index after a new Wine is added or an existing Wine is updated (see src/hot/org/vinwiki/search/IndexHelper.java for details).

Edit Distance as Measure of Similarity Between Terms

Lucene's SpellChecker uses the notion of "edit distance" to measure the similarity between terms. When you call spellChecker.suggestSimilar(wordToRespell, ...), the checker consults an instance of org.apache.lucene.search.spell.StringDistance to score hits from the spell checker index. Here's how a SpellChecker is constructed:

Directory dir = FSDirectory.open(new java.io.File("lucene_index/spellcheck"));
    SpellChecker spell = new SpellChecker(dir);
    spell.setStringDistance(new LevensteinDistance());
    spell.setAccuracy(0.8f);
    String[] suggestions = spell.suggestSimilar(wordToRespell, 10,
                             indexReader, Wine.SPELL_CHECK_SEARCH_FIELD, true);

The LevensteinDistance class implements StringDistance by calculating the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

Spell Correction Heuristics

The spellchecker module does a fine job of suggesting possible terms, but we still need to decide how to handle these suggestions depending on the structure of the user's query. To keep things simple, our first heuristic applies to single term queries where the mis-spelled term does not exist in our spell checker index. In this case, we simply re-issue the search using the best suggestion from the spell checker. Here is a screen shot of how we present the results to the user:
Spell correction for single term

However, we can't be sure that a term is mis-spelled if it exists in the spell checker index. Thus, if a single mis-spelled term exists in the spell correction index, then you need to decide if you are going to just "OR" the term mis-spelled and suggested terms together or let the user decide (it seems like Google doesn't always do one or the other, so their implementation is a bit more advanced than the one I'll provide here).

In the screenshot above, the user queried for "velvety tanins"; both terms exist in the spell correction index, but "tanins" is mis-spelled. During query processing, the SpellChecker suggested "tannins" for the mis-spelled word "tanins", which is correct. Thus, we search for "velvety tanins", but also suggest a spell-corrected query "velvety tannins", allowing the user to click on the suggested correction to see better results (hopefully). Please refer to the checkSpelling method in the src/hot/org/vinwiki/search/FeatureRichSearch.java for details on how these simple heuristics are implemented.

Type-ahead Search Suggestions

Most users appreciate when an application spares them un-necessary effort, such as typing common phrases into the search box. Thus, we'll use RichFaces <rich:suggestionbox> tag to provide a drop-down suggestion list of known search terms after the user types 3 or more characters. While trivial, this feature can help users be more productive with your search interface and helps minimize spelling mistakes.
Type-ahead Suggestion List

In /WEB-INF/facelets/searchControls.xhtml, we attach the <rich:suggestionbox> to the search input text field using:

<rich:suggestionbox id="suggestionBox" for="basicSearchQuery" ignoreDupResponses="true"
                    immediate="true" limitToList="true" height="100" width="250" minChars="3"
                    usingSuggestObjects="true" suggestionAction="#{nav.suggestSearchTerms}"
                    selfRendered="true" first="0" var="suggestion">
  <h:column><h:outputText value="#{suggestion}"/></h:column>
</rich:suggestionbox>

In the first posting, I discussed queue and traffic flood protection using RichFace's Global Default Queue. You should definitely active an AJAX request queue if you are using type-ahead suggestions. I've also added an animated GIF and loading message using the <a4j:status> tag.

On the Java side, I've resorted to a simple LIKE query to the database, but you can also query a field analyzed with Lucene's org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter in your full-text index.

Advanced Search

Sometimes a single search field isn't enough to pin-point the information you're looking for; advanced search addresses this, albeit uncommon, need for users to formalize complex queries. The advanced search form is application specific but most allow the user construct a query composed of AND, OR, exact phrase, and NOT clauses, as shown in the following screen shot:

As mentioned above, the org.vinwiki.search.SearchCriteria class manages the advanced search form data. I'll refer you to the source code for further details about advanced search. Notice that I'm using a RangeQuery to implement the Date field on the search form.

Testing Search

Be careful when building tests involving Lucene because lib/test/thirdparty-all.jar contains an older version of Lucene. To remedy this, I added the following list of JARs to the top of the test.path path in build.xml:

<fileset dir="${lib.dir}">
    <include name="lucene-core-2.9.2.jar"/>
    <include name="lucene-highlighter-2.9.2.jar"/>
    <include name="lucene-misc-2.9.2.jar"/>
    <include name="lucene-snowball-2.9.2.jar"/>
    <include name="lucene-spatial-2.9.2.jar"/>
    <include name="lucene-spellchecker-2.9.2.jar"/>
    <include name="lucene-memory-2.9.2.jar"/>
    <include name="solr-core-1.4.0.jar"/>
    <include name="solr-solrj-1.4.0.jar"/>
  </fileset>

A new test case was added for testing search, see src/test/org/vinwiki/test/SearchTest.java. While fairly trivial, this class helped me root out a few issues (like updating the spell correction index after an update) while developing the code for this post. So test-driven development shows its worth once again!

Future Enhancements to the Search Engine

There are still a few search-related features I think this application needs, including tagging, phrase handling, and synonyms. Specifically, user's should be able to add tags like "jammy" or "flabby" when rating wines. The application should be able to render a tag cloud from these user-supplied tags as another form of navigation. User tags should also be fed into the search index (using caution of course). Phrase detection complements user-supplied tagging by recognizing multi-term tags during text analysis. How you handle stop words also affects exact phrase matching. The contrib/shingles module helps speed up phrase searches involving common terms, so I'd definitely like to investigate its applicability for this application. Lastly, synonyms help supply the search index (and/or search queries) with additional terms that mean the same thing as other words in your documents. If time allows, I'll try to add a fifth post dealing with tags, phrases, and synonyms. I'd also like to hear from the community on other features that might be helpful.

TODO: Evaluating Search Quality with contrib/benchmark

The contrib/benchmark module is one extension all Lucene developers should be familiar with; benchmark allows you to run repeatable performance tests against your index. I'll refer you to the JavaDoc for using benchmark for performance testing. In the future, I'd like to use benchmark's quality package to measure relevance of search results. The classes in the quality package allow us to measure precision and recall for our search engine. Precision is a measure of "exactness" that tells us the proportion of the top N results that are relevant to the search. Recall is a measure of "completeness" that tells us the proportion of all relevant documents included in the top results. In other words, if there are 100 relevant documents for a query and the results return 80, then recall is 0.8. Sometimes, the two metrics conflict with each other in that as you return more results (to increase recall), you can introduce irrelevant documents, which decreases precision. The quality package will give us a way to benchmark precision and recall and then tune Lucene to improve these as much as possible. If I get time, then I'll post my results (and source).

A last words about scalability

While Lucene is very fast, you can run into search performance issues if you are constantly updating the index, which requires Hibernate Search to close and re-open its shared IndexReader. Hibernate Search has built-in support for a Master / Slave architecture where updates occur on a single master server and searches are executed on replicated, read-only slave indexes in your cluster. I use this configuration in my day job and it scales well. However, you'll need a JMS queue and Message-driven EJB to allow slave nodes to send updates to the master node, so you'll have to deploy your application as an EAR instead of a WAR (to deploy the MDB). Please refer to the online documentation provided with Hibernate Search for a good discussion on how to setup a master / slave cluster.

What's Next?

In the next post, I'll add authentication and registration using Facebook Connect.

Monday, May 24, 2010

VinWiki Part 1: Building an intelligent Web app using Seam, Hibernate, RichFaces, Lucene and Mahout

Introduction

This is the first post in a four part series about a wine rating and recommendation Web application, named VinWiki, built using open source technology. The purpose of this series is to document key design and implementation decisions, which may be of interest to anyone wanting to build an intelligent Web application using Java technologies. The end result will not be a 100% functioning Web application, but will have enough functionality to prove the concepts. Specifically, here is a basic roadmap of the concepts covered in each post:

Introduction (May 24, 2010): Covers project setup, primary domain objects, and basic UI constructs such as server-side pagination, dynamic menus, and bookmarkable URLs with JSF / RichFaces / Facelets.
Full-text Search (June 14, 2010): Implements full-text search using Hibernate Search with Lucene 2.9.2.
Integrating with the Web (July 8, 2010): Authentication and registration using Facebook Connect and sharing/liking bookmarkable URLs on Facebook.
Recommendations (Sept 23, 2010): Provide wine recommendations using Apache Mahout.

The idea for VinWiki was born out of two interests in my life: wine and collective intelligence. I'm a wine connoisseur and love red wines from Piedmont (Italy) and the Dry Creek and Alexander Valley's in Sonoma. I don't, however, drink expensive wine. Rather, I'm always on the lookout for that $10-15/bottle wine that just tastes good (who isn't right?) My strategy is to know a lot about a little. For example, I know the major varietals in Piedmont, their characteristics, which years were good and which were not, etc. I know these things because I want to be able to drink great inexpensive wines and help my friends do the same. Wouldn't it be nice if you had access to all this knowledge in my head the next time you go out for Italian food in North Beach?

One of my other interests is figuring out how to harvest intelligence from the interactions and contributions of users on a Web site and then apply that intelligence to improve the personal experience with the application as well as the application as a whole. The science behind this is called Collective Intelligence. To keep things simple, I consider a Web application to be "intelligent" if it has the following key ingredients:

automatically improves its capabilities as more users contribute to it,
scalable (machine learning algorithms do better with large amounts of data),
mashable (simple Web services that expose data and services to other applications), and
doesn't pretend to be more than it is!

There's a wealth of knowledge available about how to implement the first ingredient using machine learning. And, as we'll see in the 4th post in this series, there is a wonderful open source project named Mahout that provides scalable implementations of many popular machine learning algorithms, such as clustering, classification, and recommendations. However, please keep in mind that it takes rigorous testing and experimentation to ensure these algorithms produce good results for your data. Don't assume that because an algorithm sounds sophisticated that it will produce good results in your application. Hence, the last key ingredient! In this application, intelligence comes in the form of making recommendations for wines from previous ratings using collaborative filtering. Presumably, as more users contribute wines and ratings, the application will improve for all users.

Toolset

The solution is built using the following open-source technologies:

JBoss AS 4.2.3
Seam 2.2.0
Hibernate (Core, EntityManager, Annotations, Validator & Search)
JSF 1.2 / RichFaces 3.3.2 SR1 / Facelets
Lucene 2.9.2
Mahout 0.3 / Hadoop 0.20.2

Recommended Reading List

Here are a few excellent books that introduce you to the subject of intelligent Web applications:

Algorithms of the Intelligent Web
Collective Intelligence in Action
Mahout in Action
Programming Collective Intelligence (code samples in Python, but still a great read and Python is a fun scripting language to know anyway ;-)

If you are interested in full-text search, the following two books are invaluable assets:

Basic Requirements

There are three primary activities you can do with this application:

Rate wines you've already tried
Find and read ratings for wines you may like to try
Receive recommendations of new wines you should to try based on your previous ratings

These activities equate to the following use cases:

Register New User Account view solution
Basic Navigation: Browse Wine by Region, Date, Tag, or Popularity view solution
View detailed information for a specific wine view solution
Add Rating for an Existing Wine view solution
Add New Wine view solution
Edit Wine view solution
Search for Wine view solution
Share wines and ratings with Facebook friends view solution
View Recommendations view solution

Here is a screen shot of the home page (note: best I could do with the image quality since blogger limits the width to 800).

Domain Objects

The following UML Class Diagram depicts the primary domain objects in our model.
UML Class Diagram for the org.vinwiki.model package

JPA Entity	Description
Region	Encapsulates information about a geographical area that produces wine, such as Bordeaux. Implements the org.vinwiki.model.Composite interface to represent a hierarchical tree of regions and sub-regions.
Varietal	Encapsulates information about grape variety used to make wine, such as Chardonnay.
VarietalPct	Represents the percentage of a varietal in a specific wine.
Winery	Encapsulates information about a wine producer, such as Robert Mondavi. A Winery may have a latitude and longitude specified, which would allow us to show the winery on map of wineries in a region.
Wine	Encapsulates information about a specific wine, uniquely identified by name, winery, and vintage (typically the year the grapes were harvested).
User	Encapsulates information about a registered user in the application.
Preferences	Encapsulates user-supplied settings used to personalize the application.
Tag	Holds information for a keyword created by a user for rating a wine.
UserTag	Associates a tag with a user's rating.
Rating	Represents a specific user's rating (score and comments) of a specific wine.

Relational Database Model

Here is the Hibernate model translated into a MySQL database model:
Relational Database Model

Seed Data

Since an application like this isn't very interesting without some real data, I extracted some wine-related entities from Freebase using their RESTful Web Services API, along with some manual tweaking to better fit the desired data model. Specifically, the seed data contains:

104 wine regions / sub-regions
487 grape varietals
772 wineries
2,025 wines

So this should get us started with some real data, but as we'll see below, users will also be able to add new wines as needed.

Persistence Annotations

Most of the JPA annotations in the domain model are straight-forward, so I'll refer you to the source code for further study. However, I'm using a few annotations that warrant further discussion, including:

Hierarchical Regions (composite pattern)

A wine Region can have sub-regions, such as France has Bordeaux and Alsace. JPA makes modeling composites really easy using @ManyToOne with @JoinColumn:

@ManyToOne(fetch = FetchType.LAZY)
    @JoinColumn(name = "parent_region", nullable = true)
    public Region getParentRegion() {
        return this.parentRegion;
    }

Yep! It really is that easy ...

@javax.persistence.MappedSuperclass

Entities typically share a common set of fields, such as ID, name, description so it is common to encapsulate these common fields into a base class. The javax.persistence.MappedSuperclass annotation designates a class whose mapping information is applied to the entities that inherit from it. As you can see from the class diagram, the org.vinwiki.model.AbstractItemBase class serves as the @MappedSuperclass in our model.

@org.hibernate.annotations.Cache

If you think about it, most of the wine-related entities in this model are primarily read-only in nature. The main user activity in this application is to add ratings to existing wine objects. Thus, the Wine, Winery, Varietal, and Region entities are good candidates for caching in what is called the second-level cache. We use the @org.hibernate.annotations.Cache annotation to define the cache concurrency strategy for each entity we want to cache. I'm using org.hibernate.annotations.CacheConcurrencyStrategy.NONSTRICT_READ_WRITE because the application may need to update these entities occasionally, but in most cases they are used as read-only objects. For more information about cache concurrency strategy, please refer to The Second Level Cache section in the Hibernate Core documentation.

Of course, we also need to enable the second-level cache in persistence.xml. I found it easiest to use Ehcache initially but you should research the other providers, e.g. JBoss Cache, to determine the most appropriate solution for your application. I'll revisit this decision when I tackle clustering this application in the cloud. For now, here are the salient properties specified in persistence.xml:

<property name="hibernate.cache.provider_class" value="net.sf.ehcache.hibernate.EhCacheProvider"/>
  <property name="hibernate.cache.use_query_cache" value="true"/>
  <property name="hibernate.cache.use_second_level_cache" value="true"/>
  <property name="hibernate.cache.region_prefix" value=""/>
  <property name="hibernate.generate_statistics" value="true"/>
  <property name="hibernate.session_factory_name" value="SessionFactories/vinwikiSF"/>

Notice that I'm enabling Hibernate Statistics using the hibernate.generate_statistics property. During startup, I deploy the Hibernate StatisticsService MBean using:

hibernateMBeanName = new ObjectName("Hibernate:type=statistics,application=vinwiki");
    StatisticsService mBean = new StatisticsService();
    mBean.setSessionFactoryJNDIName("SessionFactories/vinwikiSF");
    ManagementFactory.getPlatformMBeanServer().registerMBean(mBean, hibernateMBeanName);

I also use Ehcache as the Seam cache provider, in resources/WEB-INF/components.xml:

<cache:eh-cache-provider/>

This will come in handy when we start displaying tag clouds and other expensive data structures to our users.

Be sure to include ehcache.jar in your deployed-jars.list file!

Solutions

In this section, I walk you through the solutions to each use case described above.

For this requirement, I leveraged the solution from my previous blog post. We'll see in post #3 in this series how to allow Facebook users to automatically register and authenticate using their Facebook account. However, I did make a few minor improvements to the handling of user preferences so be sure to review the code in the org.vinwiki.user package after reading the aforementioned blog post.

Basic Navigation

There are a number of ways a user can browse for wines in the application, including by region, most recent, best rated, as well as search. Moreover, a user can switch between these mechanisms at any time. The nav component (org.vinwiki.action.Nav) deployed in session scope supports user navigation.

The nav component can be in one of two states during its lifecycle: A) guest mode when #{identity.loggedIn} is false or B) authenticated user mode when #{identity.loggedIn} is true. This is because Seam does not create a new session component after a user is authenticated.

When a new session is created, we need to generate a default view of the wine data; a list of the Most Recent Wines added to the application seems like a good choice. I use Seam's @Factory annotation to provide the default view (the Seam documentation calls this pull-style MVC):

@Factory("mainMenu")
    public void initMainMenu() {
        if (mainMenu == null) {
            mainMenu = buildMainMenu();
        }
        // Ensure the current PagedDataFetcher is in-sync with the current request
        syncDataFetcherWithRequest();
    }

To handle the switch from state A to B, nav observes the Identity.EVENT_POST_AUTHENTICATE event:

@Observer(Identity.EVENT_POST_AUTHENTICATE)
    public void postAuthenticate(Identity identity) {
        cleanupDataProvider();
        // after login, display the user menu instead of the guest menu
        mainMenu = getMainMenu();
        syncDataFetcherWithRequest();
    }

Dynamic RichFaces PanelMenu

One method of navigating is to browse wines by region. Recall that Region implements the org.vinwiki.model.Composite interface so we can compose a hierarchical tree structure of regions. I chose the RichFaces PanelMenu for this example, but you could also adapt the code to build another type of menu. Menu items are dynamic so we need to programmatically build a org.richfaces.component.html.HtmlPanelMenu using the RichFaces API. Because the menu is a form of navigation, the nav component builds the menu. On the home page, we have a <rich:panelMenu> tag with the binding set to "mainMenu".

<a4j:form prependId="false">
    <rich:panelMenu id="mainMenu" binding="#{mainMenu}"/>
  </a4j:form>

Seam invokes the initMainMenu method of our nav component to resolve the binding. All guests share the same menu, but each user can have their own menu. You could imagine allowing users to hide specific regions they are not interested in, which will impact the rendering of the main menu. One caveat is that you should manually set the ID for all the UI components your binding creates or you will most likely encounter duplicate ID exceptions, especially after hot deployments.

Bookmarkable Navigation URLs

It would be nice if the application allowed users to bookmark a specific filter like "Most Recent" or region like "California". For example, in the screen shot below, notice that the URL in the browser address bar reflects the fact that the user clicked on California in the main menu. The user can bookmark this page in their browser to instantly view the list of Californian wines available in VinWiki.
Bookmarkable URL

Seam makes this possible using URL re-writing and page parameters. My solution is largely based on the Blog example provided in the Seam documentation. When the menu was constructed, I set the action for the California region menu item to be /region.xhtml?r=California. In pages.xml, I link the "r" request parameter to the region property of an event-scoped component bookmarkable, see org.vinwiki.action.BookmarkableRequestInfo. In addition, I use a rewrite rule to make the URL more intuitive (/region/California instead of /region.seam?r=California):

<page view-id="/region.xhtml">
    <rewrite pattern="/region/{r}"/>
    <param name="r" value="#{bookmarkable.region}"/>
  </page>

This is where the aforementioned syncDataFetcherWithRequest method comes into play; this method uses the injected bookmarkable component to determine which data to show to the user, which in this case is the list of Californian wines.

One drawback to allowing navigation requests to be bookmarked in this manner is that I had to duplicate the contents of view/home.xhtml into filter.xhtml because it seems that adding multiple rewrite patterns on a single page leads to some weird URLs. I'm sure this could be overcome with some work, but since I'm using Facelets, the amount of duplication is minimal (and of course the UI is not final so the filter page may end up being different anyway). In the next posting, I'll show you how to make search results bookmarkable as well.

Server-side Pagination

Server-side pagination is essential for gracefully handling a large number of objects. The basic idea is to only display a small subset of the total results to the user at one time. The key, of course, is to extend the sub-set concept to the server and only load small sub-set of data from the database at a time. Only doing client-side pagination will become a major performance issue as the number of objects in your database increases. Thankfully, RichFaces does most of the work for us! Here is a screenshot of VinWiki's scrollable data table backed by server-side pagination:
scrollable data table screenshot

And, the corresponding JSF syntax (from view/WEB-INF/facelets/itemTable.xhtml:
scrollable data table JSF syntax

My solution is based largely on the sample code provided with RichFaces (see org.richfaces.demo.extendeddatamodel.AuctionDataModel). Essentially, my <rich:dataTable> interacts with an event-scoped component navDataModel that extends org.ajax4jsf.model.SerializableDataModel. Under the covers, the DataModel (see org.vinwiki.action.PagedDataModelBase) component does everything needed to support AJAX-driven pagination except the actual loading of data. To load data, the DataModel delegates to a org.vinwiki.action.PagedDataProvider, whose implementation is a session-scoped component that loads and caches items from the database.
Under the covers, the number of rows displayed to the user per page comes from the user's preferences.
The RichFaces <rich:datascroller> provides the paging mechanism for our paged data table.

Here is a UML class diagram showing the relationship between the DataModel and DataProvider (most of these classes can be used in your project as they have nothing to do with Wine):
UML Class Diagram of server-side pagination

UML Class Diagram of server-side pagination

As you might expect, the concrete implementation of org.vinwiki.action.PagedDataProvider is our handy nav component which extends org.vinwiki.action.PagedDataProviderBase. PagedDataProviderBase does most of the work except the actual fetching of a page of data from the database, which is provided by a org.vinwiki.action.PagedDataFetcher. There is a concrete implementation of org.vinwiki.action.PagedDataFetcher for each type of navigation. For example, the org.vinwiki.action.FetchRegion class provides Wine objects for a specific region to the data provider. Notice that my current PagedDataFetcher implementations are *not* Seam components and expect the EntityManager and User ID to be passed to them. I chose this approach to make it really easy to build new types of navigation queries; all you have to do is provide a query to return a page of data and a query to return the total number of items available to the current user.

The PagedDataProviderBase maps each starting index to a List of Item objects. A Map is a good solution if you are using the <rich:datascroller> because the user can jump from page 1 to 10 without going through pages 2-9 first. If you're using a sequential paging mechanism then a simple List should suffice. Of course the Map could grow very big if the user visits many pages in the scroller. I'll leave it as an exercise for the reader to solve using SoftReferences.

View Wine Details

Assuming the user finds a wine of interest while browsing the application, they may want to view more information for the wine as well as scroll through other users' ratings. wine details screenshot

You didn't know the Rat Pack liked wine and wrote Latin did you ;-)

There are a number of possible activities the wine details view can offer the user, including:

Add / edit rating
View a list of similar wines (MoreLikeThis)
Contextual display Ads
Navigation controls to view the next wine in the results
Edit information about the wine itself

Consequently, it makes sense to implement a new page for viewing wine details ( view/wine.xhtml ). From the home page, we take the user to the wine details page using a Seam <s:link> tag, which allows us to pass the Wine ID as a request parameter:

<s:link view="/wine.xhtml" value="#{item.fullName}" style="font-size:14px;">
    <f:param name="wid" value="#{item.id}"/>
  </s:link>

In pages.xml, I map a request parameter to the wineId property of the viewWine component:

<page view-id="/wine.xhtml">
    <rewrite pattern="/wine/{wid}"/>
    <param name="wid" value="#{viewWine.wineId}"/>
  </page>

Notice that I'm also re-writing the URL. Link the "wid" parameter to #{viewWine.wineId} begins a new conversation for viewing the wine:

@Begin(flushMode = FlushModeType.MANUAL, join = true)
    public void setWineId(Long wineId) {
       ...
    }

There are a number of ways to begin a conversation so be sure to research the best approach for your application in the Seam documentation.

The wine details page can be bookmarked as well. In this case, however, I'm using the push-style MVC approach discussed in the Seam documentation. I had to get clever with the Home link on the details page because if the user comes through a bookmark, history.go(-1) won't work for returning the user to the VinWiki home page. Thus, I rely on an AJAX call to end the conversation and then invoke the backToHome JavaScript function:

<a4j:commandLink value="#{messages.backToHome}" action="#{viewWine.endViewWine()}" oncomplete="backToHome()" style="font-size:14px;"/>

From view/layout/template.xhtml:

function backToHome() {
    var currentHost = document.location.protocol+"//"+document.location.host;
    if (document.referrer && document.referrer.startsWith(currentHost)) {
        history.back();
    } else {
        window.location.replace("/vinwiki/");
    }
}

Add Rating for Existing Wine

There are two ways a user can add a rating:

lookup a wine on-the-fly from the home page, or
view details for a specific wine and then add the rating from the wine details view.

Rating with on-the-fly Lookup

For authenticated users, the header menu on the home page includes a link to "Add Rating". Clicking on the Add Rating link, triggers the #{ratingHome.beginRatingWine()} action using AJAX. When the action completes, the addRatingPanel is shown to the user.
Add rating from Home Page, requires lookup of wine on-the-fly.

From view/WEB-INF/facelets/headerControls.xhtml

<a4j:commandLink value="Add Rating" action="#{ratingHome.beginRatingWine()}" reRender="addRatingPanel"
            oncomplete="#{rich:component('addRatingPanel')}.show()" styleClass="hdrLink"/>

The conversational ratingHome component (org.vinwiki.action.RatingHome based on Seam's Application Framework) manages adding and editing Rating objects by users.

So now let's see how on-the-fly lookup works ... from view/WEB-INF/facelets/addRating.xhtml
JSF syntax for auto-complete on wine field.

JSF syntax for auto-complete on wine field.

Decorate the form field as a required field using Seam's <s:decorate> tag and a Facelets template. If a validation error occurs, Seam will decorate the field with error information.
When the "onblur" event fires on the input field, send an AJAX request to the server to determine if the user selected a known wine.
Attach a RichFaces suggestion box <rich:suggestionbox> to the input field to auto-complete the wine name as the user types. On the server, the RichFaces suggestion box invokes the #{ratingHome.autoCompleteWine} action. I'll discuss the RichFaces request queue in more detail below, but for now, you should realize that it is dangerous to have an auto-complete component without some sort of request flood control in place.

So assuming the user has successfully filled-in the rating form, let's see what happens when the form is submitted:
JSF syntax to submit the add rating form.

There's a fair amount of complex machinery going on in this one tag! Let's analyze it step-by-step:

A RichFaces <a4j:commandButton> submits the form using AJAX to invoke the #{ratingHome.saveRating} action within the same conversation started when the user clicked on Add Rating.
The reRender attribute tells RichFaces to re-render the addRatingPanel and itemTable components in the component tree and updated in the browser DOM after the AJAX response is completed. This ensures that the panel will display any errors that occur on the server.
If there are no errors, close the modal panel. Notice that I'm calling the hasJsfErrMsg JavaScript function, which returns true if there are any error messages queued in the FacesContext. The hasJsfErrMsg function is defined in my main Facelets layout template (view/layout/template.xhtml) and relies on an <a4j:outputPanel> to update the value of a hidden field with ID 'jsfMsgMaxSev' on every AJAX request. The value for this field is pulled from the FacesContext.getMaximumSeverity() method.
```
<a4j:outputPanel ajaxRendered="true">
    <a4j:form prependId="false">
      <h:inputHidden id="jsfMsgMaxSev" value="#{jsf.maxMsgSev}"/>
    </a4j:form>
  </a4j:outputPanel>
```

Rating from Details

When a user views a wine, a new conversation is started with the viewWine component.
Add rating from wine details, wine to rate is already selected.

Recall that we are already in a conversation when viewing the wine. Thus, the Add Rating operation will occur in a nested conversation using the ratingHome component.

@Begin(nested=true, flushMode=FlushModeType.MANUAL)
    public void beginRatingWine(Wine currentWine) {
        // attach the objects needed to create a rating to the
        // extended PersistenceContext for this conversation
        user = getEntityManager().merge(currentUser);
        wine = getEntityManager().find(Wine.class, currentWine.getId());
        updateStateForWine();
        info("User {0} has starting rating wine {1} in NESTED conversation {2}.", user.getUserName(), wine.getFullName(), Conversation.instance().getId());
    }

Upon starting the nested conversation, you need to "attach" the User and Wine objects to the extended PersistenceContext (entityManager).

Add New Wine

What type of wiki would this be if user's could not add new content on-the-fly? Of course, this application is not a full-featured wiki (see the wiki project in the Seam examples), but any authenticated user can add a new Wine (and dependent objects like Winery and Region) to the application. The wineHome component does most of the heavy lifting for adding a new Wine to the system, see org.vinwiki.action.WineHome. WineHome provides persistence operations for Wine entities by extending org.jboss.seam.framework.EntityHome from the Seam Application Framework. The implementation is mostly uninteresting from a re-use perspective, with the exception of being a @Factory for a component named wine.

@Factory("wine")
    public Wine initWine() {
        return getInstance();
    }

This allows EL expressions in /admin.xhtml to refer to the new Wine object simply as wine, such as #{wine.type}. The wine is managed in conversation scope along with wineHome.

We'll need to address the threat of spam and cross-site scripting (XSS) hacks before we can release this application on the Web, which I'll address in post #3.

Edit Wine

Edit wine also uses the wineHome component and /admin.xhtml.

So that about covers the specific use cases for this posting. Now, let's look at some specific UI implemenation details that will help you build better apps with Seam and RichFaces.

Other Useful UI Implementation Details

Facelets

I'm using Facelets for templating and UI component re-use. The main layout template is view/layout/template.xhtml. Each view references this template in the root Facelets <ui:composition ... template="layout/template.xhtml"> element.

RichFaces Semantic Layouts

I've started experimenting with the semantic layouts support provided by RichFaces (see view/layout/template.xhtml). I think the layouts help reduce CSS you need to manage for application so their definitely worth checking out.

RichFaces ModalPanel, Forms, and Seam Conversations

The RichFaces ModalPanel component is an excellent tool for allowing the user to perform a quick operation without losing their current context with your application. For example, the Add Rating ModalPanel allows the user to add a rating to the wine they are currently viewing without going to a separate page. Unfortunately, ModalPanel can also be major source of frustration if not handled correctly, especially when working with an AJAX-driven form on the panel. In this section, I hope to spare you some of that frustration by tackling some of the troublesome areas you may encounter when working with ModalPanels.

Opening the ModalPanel (and beginning a Conversation)

For the complete code for this section, please see view/WEB-INF/facelets/userPreferences.xhtml.

JSF syntax to open a ModalPanel.

Here are some tips to keep in mind when opening a ModalPanel within a conversation:

The reRender attribute should include the ID of the panel you are opening. This ensures the UI components in the panel reflect the state of the conversation.
When the action completes, use JavaScript to show the panel via oncomplete="#{rich:component('prefPanel')}.show()"

In the action handler on the server, be sure to load any entities needed by the ModalPanel into the extended PersistenceContext for the conversation:

// Out-ject the User object that we're updating, vs. the one that came in ...
    @Out protected User updatingUser = null;

    // Just a handy reference to the preferences object we're updating ...
    @Out protected Preferences updatingPrefs = null;

    @Begin(flushMode = FlushModeType.MANUAL, join = true)
    public void beginEditPreferences() {
        // load the User entity to edit preferences for into the extended PC for this conversation
        updatingUser = getEntityManager().find(User.class, currentUser.getId());
        updatingPrefs = updatingUser.getPreferences();
        initDob();
    }

Processing the Form on the ModalPanel (and ending the Conversation)

So now you have a visible ModalPanel with a form displayed, within a Seam conversation. When the user submits the form, we need to validate the input and display any errors that occur. If no errors occur, then close the panel when the action completes. JSF syntax to process a form on ModalPanel.

JSF syntax to process a form on ModalPanel.

The action handler should end the conversation only if it completes successfully. So be careful with using @End with AJAX action handlers of type void. I prefer to use Conversation.instance().end() when dealing with AJAX actions as it makes it more explicit when the conversation ends.
Be sure to re-render the form after submit so validation errors are displayed correctly.
As mentioned previously, use a simple JavaScript function to determine if there are FacesMessages with severity ERROR or greater queued in the FacesContext; close the panel if there are no errors.

Queue and Traffic Flood Protection using the Global Default Queue

I've enabled the global queue for this application in web.xml:

<context-param>
    <param-name>org.richfaces.queue.global.enabled</param-name>
    <param-value>true</param-value>
  </context-param>

Project Setup

Building

You can download the project from here.
The project should build with minimal effort using Seam 2.2.0 with Java 6 from the command-line or Eclipse (I used v. 3.4.2). Once deployed to your JBoss 4.2.3 server, you can register or login as "vinwiki" with password "S00ner$1". However, you should also register your own account to see how the registration system works.

Database Setup

I've included a mysqldump file (vinwiki.sql) in the root directory which contains some wine related objects pulled from Freebase. To install this database in your MySQL 5.1.x database, do the following:

mysql> create database vinwiki;
mysql> grant all privileges on vinwiki.* to 'vinwiki'@'localhost' identified by 'vinwiki';
mysql> use vinwiki;
mysql> source vinwiki.sql;

Testing

The tests are configured to run against a MySQL database named vinwiki_test so that tests do not affect the development database. The JDBC connection parameters are specified in: bootstrap/deploy/hsqldb-ds.xml. From the MySQL command-line, do the following:

mysql> create database vinwiki_test;
mysql> grant all privileges on vinwiki_test.* to 'vinwiki'@'localhost' identified by 'vinwiki';

What's Next?

So by now you should have a good understanding of the basic structure of this application and be able to build it and run the unit tests. Please continue on to the second post which covers Hibernate Search and Lucene to help users find wines of interest.

thelabdude

Blog Archive

Thursday, September 23, 2010

VinWiki Part 4: Making Recommendations with Mahout

Thursday, July 8, 2010

VinWiki Part 3: Authentication with Facebook Connect and Sharing Content with Friends

Monday, June 14, 2010

VinWiki Part 2: Full-text Search with Hibernate Search and Lucene

Monday, May 24, 2010

VinWiki Part 1: Building an intelligent Web app using Seam, Hibernate, RichFaces, Lucene and Mahout