Monday, June 14, 2010

VinWiki Part 2: Full-text Search with Hibernate Search and Lucene

Introduction

This is the second post in a four part series about a wine recommendation Web application built using open source Java technology. The purpose of this series is to document key design and implementation decisions that can be applied to other Web applications. Please read the first post for an overview of the project. You can download the project (with source) from here.

In this posting, I leverage Hibernate Search and Lucene to provide full-text search for wines of interest. This is post is merely an attempt to complement the already existing documentation on Lucene and Hibernate Search. If you are familiar with Hibernate and are new to Lucene, then I recommend starting out with the online manual for Hibernate Search 3.1.1 GA. In addition, I highly recommend reading Hibernate Search in Action and Lucene in Action (Second Edition); both are extremely well-written books by experts in each field.

Basic Requirements
Full-text search allows users to find wines using simple keywords, such as "zinfandel" or "dry creek valley". Users have come to expect certain features from a full-text search engine. The following table summarizes the key features our wine search engine will support:
FeatureDescription
Paginated search results
(see first post for solution)
Return the first 5-10 results, ordered by relevance on the first page of the search results. Allow users to retrieve more results by advancing the page navigator.
Allows users to set a bounded preference for how many items to show on earch page.
Query term highlighting
(view solution)
Highlight query terms in search results to allow users to quickly scan the results for the most relevant items.
Spell correction
(view solution)
Ability to detect and correct for misspelled terms in the query.
Type-ahead suggestions
(view solution)
Show a drop-down selection box containing suggested search terms after the user has typed at least 3 characters.
Advanced search form
(view solution)
Allow users to fine-tune the search with an advanced search form.

Setup
From the previous post, you'll recall that I'm using Hibernate to manage my persistent objects. As such, I've chosen to leverage Hibernate Search (referred to as HS hereafter) to integrate my Hibernate-based object model with Lucene. For now, I'm using Hibernate Search v. 3.1.1 GA with Hibernate Core 3.3.1 GA. On the Lucene side, I'm using version 2.9.2 and a few classes from Solr 1.4. In the last posting in this series, we'll upgrade to the latest Hibernate Search and Core, which rely on JPA 2 and thus will require a move up to JBoss 6.x. As far as I know, the latest HS and Core classes have trouble on JBoss 4.2 and 5.1 because of their dependency on JPA 2 (let me know if you get it working).

Why not use the database to search?
Some may be wondering why I'm using Lucene instead of doing full-text search in my database? While possible, I feel the database is not the best tool for full-text searching. For an excellent treatment of the mis-match between relational databases and full-text search, I recommend reading Hibernate Search in Action. The first chapter provides a strong case for using Lucene instead of your database for full-text search. Just in terms of scalability, the database is the most expensive and complex layer to scale in most applications. Offloading searches to a local Lucene running on each node in your cluster reduces the amount of work your database is performing and ensures searches return quickly. Moreover, distributing a Lucene index across a cluster of app servers is almost trivial since Hibernate Search has built-in clustering support.

Project Configuration Changes
The following JAR files need to be included in your Web application's LIB directory (/WEB-INF/lib):
hibernate-search.jar
    lucene-core-2.9.2.jar
    lucene-highlighter-2.9.2.jar
    lucene-misc-2.9.2.jar
    lucene-snowball-2.9.2.jar
    lucene-spatial-2.9.2.jar
    lucene-spellchecker-2.9.2.jar
    lucene-memory-2.9.2.jar
    solr-core-1.4.0.jar
    solr-solrj-1.4.0.jar

How does Seam know to use Hibernate Search's FullTextEntityManager?
Seam's org.jboss.seam.persistence.HibernatePersistenceProvider automatically detects if Hibernate Search is available on the classpath. If HS is available, then Seam uses the org.jboss.seam.persistence.FullTextEntityManagerProxy instead of the default EntityManagerProxy, meaning that you will have access to a FullTextEntityManager wherever you have a Seam in-jected EntityManager. You also need to add a few more properties to your persistence deployment descriptor (resources/META-INF/persistence-*.xml):
<property name="hibernate.search.default.directory_provider" value="org.hibernate.search.store.FSDirectoryProvider"/>
  <property name="hibernate.search.default.indexBase" value="lucene_index"/>
  <property name="hibernate.search.reader.strategy" value="shared"/>
  <property name="hibernate.search.worker.execution" value="sync"/>
We'll make some adjustments to these settings as the project progresses, but these will suffice for now.

Indexing
The first step in providing full-text search capabilities is to index the content you want to search. In most cases, our users will want to find Wine objects and to a lesser extent Winery objects. Let's start by indexing Wine objects.

Indexing Wine Objects
I'll tackle indexing Wine objects in a few passes, progressively adding features, so let's start with the basics. First, we tell HS to index Wine objects using the @Indexed annotation on the class. As for what to search, I tend to favor using a single field to hold all searchable text for each object because it simplifies working with other Lucene extensions, such as the More Like This and term highlighting features.
@Field(name=DEFAULT_SEARCH_FIELD, index=Index.TOKENIZED)
    @Analyzer(definition="wine_stem_en")
    public String getContent() {
        // return one string containing all searchable text for this object
    }
Notice that the content field is Index.TOKENIZED during indexing. This means that the String value returned by the getContent method will be broken into a stream of tokens using a Lucene Analyzer; each token returned by the Analyzer will be searchable. Your choice of Analyzer depends on the type of text you are processing, which in our case is English text provided by our users when creating Wine objects. Hibernate Search leverages the text analysis framework provided by Solr. You can specify your Analyzer using the @Analyzer annotation. Behind the scenes, Hibernate Search creates the Analyzer using the Factory defined by an @AnalyzerDef class annotation. In our case
@AnalyzerDef(name = "wine_stem_en",
        tokenizer = @TokenizerDef(factory = org.vinwiki.search.Lucene29StandardTokenizerFactory.class),
        filters = {
          @TokenFilterDef(factory = StandardFilterFactory.class),
          @TokenFilterDef(factory = LowerCaseFilterFactory.class),
          @TokenFilterDef(factory = StopFilterFactory.class),
          @TokenFilterDef(factory = EnglishPorterFilterFactory.class)
    })
This looks more complicated than it is ... Let's take a closer look at the @AnalyzerDef annotation:
  name = "wine_stem_en"
This gives our analyzer definition a name so we can refer to it in the @Analyzer annotation. This will also come in handy when we start parsing user queries using Lucene's QueryParser, which also requires an Analyzer.
  tokenizer = 
  @TokenizerDef(factory = org.vinwiki.search.Lucene29StandardTokenizerFactory.class)
Specify a factory for Tokenizer instances. In this case, I'm supplying a custom factory class that creates a Lucene 2.9.x StandardTokenizer as the default factory provided by Solr creates a Lucene 2.4 StandardTokenizer (most likely for backwards compatibility).
  filters = {
    @TokenFilterDef(factory = StandardFilterFactory.class),
    @TokenFilterDef(factory = LowerCaseFilterFactory.class),
    @TokenFilterDef(factory = StopFilterFactory.class),
    @TokenFilterDef(factory = EnglishPorterFilterFactory.class)
  }
A tokenizer can pass tokens through a filter-chain to perform various transformations on each token. In this case, we're passing the tokens through four filters, in the order listed above.
  1. StandardFilter: Removes dots from acronyms and 's from the end of tokens. Works only on typed tokens, i.e., those produced by StandardTokenizer or equivalent.
  2. LowerCaseFilter: Lowercases letters in tokens
  3. StopFilter: Discards common English words like "an" "the" and "of".
  4. EnglishPorterFilter: Extracts the stem for each token using Porter Stemmer implemented by the Lucene Snowball extension.
You can index other fields, such as the wine type and style using the @Field annotation.
@Field(name="type",index=Index.UN_TOKENIZED,store=Store.YES)    
    public WineType getType() {
        return this.type;
    }

Building the Index
If you worked through the first post, then you should already have a database containing 2,025 wines. Hibernate Search will automatically index new Wine objects for us when the insert transaction commits, but we need to manually index the existing wines. During startup, the org.vinwiki.search.IndexHelper component counts the number of documents in the index and re-builds the index from the objects in the database if needed. During startup, you should see log output similar to:
[IndexHelper] Observed event org.vinwiki.event.REBUILD_INDEX from Thread QuartzScheduler1_Worker-8
[IndexHelper] Re-building Wine index for 2025 objects.
[IndexHelper] Flushed index update 300 from Thread Quartz ...
[IndexHelper] Flushed index update 600 from Thread Quartz ...
[IndexHelper] Flushed index update 900 from Thread Quartz ...
[IndexHelper] Flushed index update 1200 from Thread Quartz ...
[IndexHelper] Flushed index update 1500 from Thread Quartz ...
[IndexHelper] Flushed index update 1800 from Thread Quartz ...
[IndexHelper] Took 4094 (ms) to re-build the index containing 2025 documents.
Alright, with a few lines of code and some clever annotations, we now have a full-text search index for Wine objects. Let's do some searching!

Basic Search Box
From the first post, we already have a server-side pagination framework in place for displaying Wine objects. To integrate full-text search capabilities, we simply need to implement the org.vinwiki.action.PagedDataFetcher interface in terms of Hibernate Search. Here is a simple implementation (we'll add more features later):
Search.java source listing
This is sufficient to get us searching quickly.

Next, we need to pass the user's query from the search box to our nav component. In view/WEB-INF/facelets/searchControls.xhtml, update the JSF inputText tag to bind the user's query to an Event-scoped component named searchCriteria:

<h:inputText id="basicSearchQuery" styleClass="searchField" value="#{searchCriteria.basicSearchQuery}"/>
The searchCriteria component is in-jected into nav, which passes it on to an instance of org.vinwiki.search.Search which implements the PagedDataFetcher interface. You may be thinking that having a separate component for a single input field is over-kill, but the searchCriteria component will also come in very handy once we add an advanced search form. Here is what happens in nav:
@In(required=false) private SearchCriteria searchCriteria;
    ...
    public void doSearch() {
        cleanup();
        dataFetcher = new FeatureRichSearch(log, searchCriteria);
        initFirstPageOfItems(); // fetches the first page of Items
    }

Execute Search on Enter
There is the obligatory Search button next to my search input field, but most user's will just hit enter to execute the search. RichFaces makes this trivial to support using the <rich:hotKey> tag:
<rich:hotKey key="return" selector="#basicSearchQuery" 
          handler="#{rich:element('searchBtn')}.onclick();return false;"/>
The <rich:hotKey> tag binds a JavaScript event listener for the search box to submit the form when the user hits "enter". Be sure to use the selector attribute to limit the listener to the search box and not all input text boxes on the page! Any valid jQuery selector will do ...

At this point, re-compile and re-deploy these changes (at the command-line: seam clean reexplode). A search for "zinfandel" results in:
Search results for zinfandel
Hopefully, you get something similar in your environment ;-) Yes, those images are terribly ugly right now ... They were pulled dynamically from Freebase; if this were a real production application, then I'd work on getting better images. Next, we'll add some more features to the search engine, such as query term highlighting, spell correction, more like this, and filters. These additional features are implemented in the org.vinwiki.search.FeatureRichSearch class.

 

Highlighting Query Terms in Results
Highlighting query terms in results is a common and very useful feature of most search engines. The contrib/highlighter module in Lucene makes this feature easy to implement. There are two aspects to consider when displaying results to the user. First, we want to display the best "fragment" from the document text for the specific query. In other words, rather than just showing the first X characters of the description, we should show the best section of the description matching the user's query. Second, highlight the query terms in the chosen fragment. However, most of the description text for our Wine objects is short enough to display the full description for each result. Based on my current UI layout, 390 is a safe maximum length for fragments as it is long enough to provide useful information about each wine, yet short enough to keep from cluttering up the screen with text. This is something you'll have to work out for your application.

Highlighting Fragments with Hibernate Search
Recall that we stuff all the searchable text information about a Wine into a single searchable field "content". While good for searching, it's probably not something you want to display to your users directly. Instead, I've chosen to highlight terms in the wine description only. Here is how to construct a Highlighter (from the Lucene org.apache.lucene.search.highlight package):
// Pass the Lucene Query to a Scorer that extracts matching spans for the query
    // and then uses these spans to score each fragment    
    QueryScorer scorer = new QueryScorer(luceneQuery, Wine.DEFAULT_SEARCH_FIELD);
    // Highlight using a CSS style    
    SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span class='termHL'>", "</span>");
    Highlighter highlighter = new Highlighter(formatter, scorer);
    highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, Nav.MAX_FRAGMENT_LEN));
Also notice that I get a reference to the "wine_stem_en" Analzyer using:
Analyzer analyzer = searchFactory.getAnalyzer("wine_stem_en");
While iterating over the results, we pass the Analyzer and the actual description text to the Highlighter. The observant reader will notice that I didn't "stem" the description text during indexing, but now I am stemming the description text for highlighting. You'll see why I'm taking this approach in the next section when I add spell correction.
highlightedText = highlighter.getBestFragment(analyzer, Wine.SPELL_CHECK_SEARCH_FIELD, description);
FeatureRichSearch saves the dynamically highlighted fragments in a Map because fragments are specific to each query. If the query changes, then so must the fragments for the results. The getResultDisplayText method on the nav component interfaces with the FeatureRichSearch dataFetcher to get fragments for search results.

 

Spell Correction
For spell correction, I'm using the contrib/spellchecker module in Lucene (see http://wiki.apache.org/lucene-java/SpellChecker), which is a good place to start, but you should realize that handling spelling mistakes in search is a complex problem so this is by no means the "best" solution.

Spell Correction Dictionary
To begin, you need a dictionary of correctly spelled terms from which to base spell corrections on. The spellchecker module builds a supplementary index from the terms in your main index using fields based on n-grams of each term. Internally, SpellChecker creates several n-gram fields for each word depending on word length. Here is a screenshot of the word "zinfandel" in the SpellChecker index courtesy of Luke:
SpellChecker Index in Luke
Recall that we're stemming terms in our default search field "content". If you build the spell checker index from this field, the dictionary will only contain stemmed terms. This has un-desired side-effect that the corrected term will look misspelled to the user. For example, if you search for "strawbarry", the spell checker will probably recommend "strawberri", which is good, but we really want to show the user "Did you mean 'strawberry'?". Thus, we need to base our spell checker index off non-stemmed text. When we ask the spell checker to suggest a term for "strawbarry", then it will return "strawberry". When we query the search index, we need to query for "strawberri", which is why I pass the "wine_stem_en" Analyzer to the QueryParser after applying the spell correction process.

Hibernate Search Automatic Indexing and Spell Correction
Lucene's SpellChecker builds a supplementary index from the terms in your main index. If your main index changes, e.g. after adding a new entity, then you need to update the spell correction index. From discussing this within the HS community, there's nothing built into HS to help you determine when to update the spell checker index, especially if you're using hibernate.search.worker.execution=async. In other words, you don't know when Hibernate Search is finished updating the Lucene index. You have a couple of options to consider here: 1) update the spell index incrementally as new content is added (or updated), or 2) update the index periodically in a background job. This depends on your requirements and how much you trust the source of the changes to the main index. For now, I'm using Seam Events to incrementally update the spell index after a new Wine is added or an existing Wine is updated (see src/hot/org/vinwiki/search/IndexHelper.java for details).

Edit Distance as Measure of Similarity Between Terms
Lucene's SpellChecker uses the notion of "edit distance" to measure the similarity between terms. When you call spellChecker.suggestSimilar(wordToRespell, ...), the checker consults an instance of org.apache.lucene.search.spell.StringDistance to score hits from the spell checker index. Here's how a SpellChecker is constructed:
Directory dir = FSDirectory.open(new java.io.File("lucene_index/spellcheck"));
    SpellChecker spell = new SpellChecker(dir);
    spell.setStringDistance(new LevensteinDistance());
    spell.setAccuracy(0.8f);
    String[] suggestions = spell.suggestSimilar(wordToRespell, 10,
                             indexReader, Wine.SPELL_CHECK_SEARCH_FIELD, true);
The LevensteinDistance class implements StringDistance by calculating the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

Spell Correction Heuristics
The spellchecker module does a fine job of suggesting possible terms, but we still need to decide how to handle these suggestions depending on the structure of the user's query. To keep things simple, our first heuristic applies to single term queries where the mis-spelled term does not exist in our spell checker index. In this case, we simply re-issue the search using the best suggestion from the spell checker. Here is a screen shot of how we present the results to the user:
Spell correction for single term
However, we can't be sure that a term is mis-spelled if it exists in the spell checker index. Thus, if a single mis-spelled term exists in the spell correction index, then you need to decide if you are going to just "OR" the term mis-spelled and suggested terms together or let the user decide (it seems like Google doesn't always do one or the other, so their implementation is a bit more advanced than the one I'll provide here).

In the screenshot above, the user queried for "velvety tanins"; both terms exist in the spell correction index, but "tanins" is mis-spelled. During query processing, the SpellChecker suggested "tannins" for the mis-spelled word "tanins", which is correct. Thus, we search for "velvety tanins", but also suggest a spell-corrected query "velvety tannins", allowing the user to click on the suggested correction to see better results (hopefully). Please refer to the checkSpelling method in the src/hot/org/vinwiki/search/FeatureRichSearch.java for details on how these simple heuristics are implemented.

 

Type-ahead Search Suggestions
Most users appreciate when an application spares them un-necessary effort, such as typing common phrases into the search box. Thus, we'll use RichFaces <rich:suggestionbox> tag to provide a drop-down suggestion list of known search terms after the user types 3 or more characters. While trivial, this feature can help users be more productive with your search interface and helps minimize spelling mistakes.
Type-ahead Suggestion List
In /WEB-INF/facelets/searchControls.xhtml, we attach the <rich:suggestionbox> to the search input text field using:
<rich:suggestionbox id="suggestionBox" for="basicSearchQuery" ignoreDupResponses="true"
                    immediate="true" limitToList="true" height="100" width="250" minChars="3"
                    usingSuggestObjects="true" suggestionAction="#{nav.suggestSearchTerms}"
                    selfRendered="true" first="0" var="suggestion">
  <h:column><h:outputText value="#{suggestion}"/></h:column>
</rich:suggestionbox>
In the first posting, I discussed queue and traffic flood protection using RichFace's Global Default Queue. You should definitely active an AJAX request queue if you are using type-ahead suggestions. I've also added an animated GIF and loading message using the <a4j:status> tag.

On the Java side, I've resorted to a simple LIKE query to the database, but you can also query a field analyzed with Lucene's org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter in your full-text index.

 

Advanced Search
Sometimes a single search field isn't enough to pin-point the information you're looking for; advanced search addresses this, albeit uncommon, need for users to formalize complex queries. The advanced search form is application specific but most allow the user construct a query composed of AND, OR, exact phrase, and NOT clauses, as shown in the following screen shot:
Advanced Search Form
As mentioned above, the org.vinwiki.search.SearchCriteria class manages the advanced search form data. I'll refer you to the source code for further details about advanced search. Notice that I'm using a RangeQuery to implement the Date field on the search form.

Testing Search
Be careful when building tests involving Lucene because lib/test/thirdparty-all.jar contains an older version of Lucene. To remedy this, I added the following list of JARs to the top of the test.path path in build.xml:
<fileset dir="${lib.dir}">
    <include name="lucene-core-2.9.2.jar"/>
    <include name="lucene-highlighter-2.9.2.jar"/>
    <include name="lucene-misc-2.9.2.jar"/>
    <include name="lucene-snowball-2.9.2.jar"/>
    <include name="lucene-spatial-2.9.2.jar"/>
    <include name="lucene-spellchecker-2.9.2.jar"/>
    <include name="lucene-memory-2.9.2.jar"/>
    <include name="solr-core-1.4.0.jar"/>
    <include name="solr-solrj-1.4.0.jar"/>
  </fileset>
A new test case was added for testing search, see src/test/org/vinwiki/test/SearchTest.java. While fairly trivial, this class helped me root out a few issues (like updating the spell correction index after an update) while developing the code for this post. So test-driven development shows its worth once again!

Future Enhancements to the Search Engine
There are still a few search-related features I think this application needs, including tagging, phrase handling, and synonyms. Specifically, user's should be able to add tags like "jammy" or "flabby" when rating wines. The application should be able to render a tag cloud from these user-supplied tags as another form of navigation. User tags should also be fed into the search index (using caution of course). Phrase detection complements user-supplied tagging by recognizing multi-term tags during text analysis. How you handle stop words also affects exact phrase matching. The contrib/shingles module helps speed up phrase searches involving common terms, so I'd definitely like to investigate its applicability for this application. Lastly, synonyms help supply the search index (and/or search queries) with additional terms that mean the same thing as other words in your documents. If time allows, I'll try to add a fifth post dealing with tags, phrases, and synonyms. I'd also like to hear from the community on other features that might be helpful.

TODO: Evaluating Search Quality with contrib/benchmark
The contrib/benchmark module is one extension all Lucene developers should be familiar with; benchmark allows you to run repeatable performance tests against your index. I'll refer you to the JavaDoc for using benchmark for performance testing. In the future, I'd like to use benchmark's quality package to measure relevance of search results. The classes in the quality package allow us to measure precision and recall for our search engine. Precision is a measure of "exactness" that tells us the proportion of the top N results that are relevant to the search. Recall is a measure of "completeness" that tells us the proportion of all relevant documents included in the top results. In other words, if there are 100 relevant documents for a query and the results return 80, then recall is 0.8. Sometimes, the two metrics conflict with each other in that as you return more results (to increase recall), you can introduce irrelevant documents, which decreases precision. The quality package will give us a way to benchmark precision and recall and then tune Lucene to improve these as much as possible. If I get time, then I'll post my results (and source).

A last words about scalability
While Lucene is very fast, you can run into search performance issues if you are constantly updating the index, which requires Hibernate Search to close and re-open its shared IndexReader. Hibernate Search has built-in support for a Master / Slave architecture where updates occur on a single master server and searches are executed on replicated, read-only slave indexes in your cluster. I use this configuration in my day job and it scales well. However, you'll need a JMS queue and Message-driven EJB to allow slave nodes to send updates to the master node, so you'll have to deploy your application as an EAR instead of a WAR (to deploy the MDB). Please refer to the online documentation provided with Hibernate Search for a good discussion on how to setup a master / slave cluster.

What's Next?
In the next post, I'll add authentication and registration using Facebook Connect.

5 comments:

  1. A very nice tutorial, thank you!

    ReplyDelete
  2. I do really like your tutorials...you are providing us with top best approaches..But please I have a request: can you provide us with a solution on how to use Facelets and Spring Web Flow in the login register application (the first one not using Seam Framework)..

    ReplyDelete
  3. In response to copernicus ... I've not worked with Spring in over a year, so I'm not the best resource for developing a Facelets + Spring Web Flow example.

    ReplyDelete
  4. This is an excellent tutorial series. I can't wait to see the rest. It's been a great introduction into things I've only scratched the surface on. I can't wait to see your post on how you integrated Mahout. Should be very interesting.

    ReplyDelete
  5. Who are you Guy? I would like to see it in SPRING framework. Anyway, its so fantastic !! I have got the early article about spring authentication and used in my own projects.

    Gilson D.

    ReplyDelete