Search

New Stemmers

Overview design of Search mechanism.

DocBook Mobile is currently using webhelpindexer.jar which is used in Mobile. It plans to replace webhelpindexer storing method with Web SQL database

The serching is a fully client-side implementation of querying texts for content searching. There's no server involved. So, the search queries by the users are processed by JavaScript inside the browser, and displays the matching results by comparing the query with a simplified 'index' that too resides in JavaScript. Mainly the search mechanism has two parts.

Indexing: First we need to traverse the content in the docs/content folder and index the words in it. This is done by webhelpindexer.jar in xsl/extentions/ folder. You can invoke it by ant index command from the root of mobile of directory. The source of webhelpindexer is now moved to it's own location at trunk/xsl-webhelpindexer/. Checkout the Docbook trunk svn directory to get this source. Then, do your changes and recompile it by simply running ant command. My assumption is that it can be opened by Netbeans IDE by one click. Or if you are using IntelliJ Idea, you can simply create a new project from existing sources. Indexer has extensive support for features such as word scoring, stemming of words, and support for languages English, German, French. For CJK (Chinese, Japanese, Korean) languages, it uses bi-gram tokenizing to break up the words (since CJK languages does not have spaces between words).
When ant index is run, it generates five output files:
- htmlFileList.js - This contains an array named fl which stores details all the files indexed by the indexer. Further, the doStem in it defines whether stemming should be used. It defaults to false.
- htmlFileInfoList.js - This includes some meta data about the indexed files in an array named fil. It includes details about file name, file (html) title, a summary of the content.Format would look like, fil["4"]= "ch03.html@@@Developer Docs@@@This chapter provides an overview of how mobile is implemented.";
- index-*.js (Three index files) - These three files actually stores the index of the content. Index is added to an array named w.
Querying: Query processing happens totally in client side. Following JavaScript files handles them.
- nwSearchFnt.js - This handles the user query and returns the search results. It does query word tokenizing, drop unnecessary punctuations and common words, do stemming if docbook language supports it, etc.
- {$indexer-language-code}_stemmer.js - This includes the stemming library. nwSearchFnt.js file calls stemmer method in this file for stemming. ex: var stem = stemmer(foobar);