A maze of twisty little passages

UnknownIn the previous two posts I have introduced both the concept of stemming as well as Mitopia’s framework for combining multiple stemming algorithms with backing dictionaries in order to improve stemmer accuracy in any given language.  In this post I want to describe the basic concepts of the Fairweather Stemmer (referred to hereafter as the ‘new stemmer’), a patented new stemming algorithm that:

  • Can be applied to any language or scripting system held as UTF-8 text.
  • Can be trained by normal people to attain close to 100% accuracy in a language (standard algorithms max out at 30% or less).  A complete GUI environment for training (not detailed in this post) is provided.
  • Unlike today’s algorithms, can generate multiple roots for any given word.  Handles multi-root words, suffix and prefix stripping & mapping.
  • Generates true linguistic (human readable) roots.
  • Maps/conflates non-English stemmed roots to English equivalents to allow accurate cross-language and multi-language search.
  • Is fast enough to be applied in all standard stemming applications.
  • Runs within the previously described stemming framework and so can take advantage of high speed specialized dictionaries (people, places, proper names, acronyms, etc.).

Essentially this universal stemming approach, based as it is on shortest path techniques, rather than linguistic principles (which of course are different for each language), has the potential to replace all other stemming algorithms while vastly improving stemming accuracy and utility thus greatly improving text search, regardless of language.  The technique is fundamental to Mitopia’s cross language querying capabilities, but can be applied in any other search context through use of the available stand-alone library.

Overview

Now that we have described the stemming framework itself and how it is integrated with any existing database, it is possible to describe the shortest-path based stemming algorithm in isolation from any surrounding context. As discussed previously, the new stemmer algorithm is based on the premise that in essentially all languages, it is possible to construct all word variations from the following basic pattern:

[ Prefix ] Root { [ Infix ] Root } [ Suffix ]

The algorithm requires an extremely fast lexical analyzer lookup in order to allow it to recognize all the various word portions above.  In the preceding description of the basic stemmer context structure, the fields  ‘fRoots’, ‘fPrefix’, ‘fSuffix’, and ‘fInfix’ represent the various recognizers required by the lexical analyzer LX_Lex() in order to recognize all allowed string fragments that make up any of the legal values for the root, prefix, suffix, and infix sequences in the language concerned.  These dictionaries are loaded automatically during initialization (within “makeStemmerContext”) by the framework if the corresponding files are present in the directory for the language concerned.  Given these dictionaries, the algorithm for the stemmer comprises a main routine and a recursive subroutine, where the main routine essentially handles exploring all possible combinations of prefix and suffix (including of course no prefix and no suffix), while the recursive inner subroutine is responsible for exploring all sequences of root(s) and infix fragments that could make up the central portion of the word being stemmed.  The inner subroutine must be recursive (or equivalent) in order to handle arbitrary numbers of roots within a word as implied by the phrase “Root { [ Infix ] Root }”.

If there is more than one prefix or suffix involved, the algorithm sorts them by increasing order and then appends the no prefix and no suffix case to the end of this sorted list.  The prefix/suffix loop in the main routine then start from the no prefix/suffix case and proceed backwards from the longest prefix/suffix to shorter prefixes and suffixes.  This strange ordering is chosen for a number of important reasons viz:

  1. By starting with the no prefix, no suffix case, the common case of recognizing a word that is already a complete root word (i.e., has no prefix or suffix) is examined first, and if found, the loops are immediately terminated thus ensuring that this common case results in minimal execution time.
  2. Sometimes as a result of the vagaries of the cost curves discussed below, a situation arises where the minimum cost path through a word does not result in the correct stem, and this cannot be fixed by any other means that adding a longer root word fragment mapped to the same root output to cause a new lower cost path.  The fact that the no affix (i.e., no suffix and prefix) case occurs first allows this technique to be used in the final resort to solve any otherwise unsolvable problem with cost calculations, since as can be seen, the routine will stop immediately if it recognizes a naked root word.
  3. After the no suffix/prefix case, the fact that the loop then continues in decreasing order of affix size is taken advantage of by the inner recursive function in order to differentially favor longer trailing roots in the word over longer suffixes, and this logic significantly reduces the amount of dictionary tweaking necessary to overcome suffix/trailing root ambiguities.

When constructing the list of suffixes for any given word, the function first inverts the order of the characters in the word, and then uses a variant of LX_Lex() to obtain a set of suffixes.  It was noted earlier that the custom option ‘kFlipCharOrder’ when passed to the dictionary load routine, causes it to construct a recognizer for the dictionary entries with the character order swapped.  Since this option is always specified when suffix dictionaries are loaded, the recognizer is already set up to recognize the suffixes that start the flipped word to be stemmed, and thus the algorithm avoids a great deal of time and complexity that would otherwise be required in order to obtain a list of all potential suffixes.

Cost Calculations

The algorithm obtains a cost measure for both the prefix and suffix as well as roots and infixes by calling the cost calculator function set up in the basic context field ‘fCostCalc’, and passing it the word part involved, the number of characters involved, and the average number of bytes per character for the language (set up by the “makeStemmerContext” call in the basic context field ‘bytesPerChar’ which is 1 by default).  It is actually possible to alter this calculation for a given language by overwriting the default value placed in this field by QX_InitBasicStemContext().  Thus far however, and somewhat surprisingly, we have found that the default cost calculator function appears to work well for all languages, regardless of script system, and for this reason, it it recommended that you do not change the cost computation for your language unless it is clear that some unique feature of the language makes this necessary.

CostCurvesIf we plot the cost curve for each word part against the normalized number of characters involved (i.e., divided by the average number of UTF-8 bytes per character in the language) we see the graph depicted above.

  • root: cost =  2N – N/2 + 1
  • prefix: cost = 2N + 1
  • suffix: cost = 2N -2
  • infix: cost = 2N + 2

From the cost curves we can see a number of significant points that are crucial in order to minimize the number of special cases that must be explicitly handled in the dictionaries, thereby minimizing dictionary size and reducing the amount of work necessary to get correct results in all cases.

The first point is that at the low end of the character axis, the cost of suffixes is significantly less than that of roots, whereas as the root fragments become longer they become significantly cheaper than the corresponding suffix.  This behavior serves two main purposes, the first is that the algorithm tends to favor small suffixes at the end of a word rather than finding a path by stringing together multiple roots.  As an example, consider the English word “contraction” which can be parsed either as two roots as in “contract” and “ion” for a cost of 20, or as a single root “contract” followed by the suffix “ion” for a cost of 17.  This behavior causes the ambiguity between small trailing root words like “ion” and suffixes to be resolved in favor of the suffix in most cases, and it turns out that this is normally the appropriate behavior.  As the length of the suffix increases however, this behavior becomes inappropriate, since long suffixes might cause significant trailing roots to be lost.  An example of this is the English word “rationalistically”, which can be parsed in two ways, namely as the root “ration” followed by the suffix “alistically” (cost 30), or the root “rational” followed by the suffix “istically” (cost 29).  The increased cost associated with the long suffix “alistically” is what causes the longer root “rational” to be chosen over the shorter root “ration”.  We can see that there is a grey area in the curves for root and suffix sizes of 4 or 5 characters, and it is this grey area that is the reason why the algorithm proceeds from the longest suffix to the shortest, since this allows the inner recursive loop to detect a situation where the costs are identical and choose in favor of the longer root when this occurs.  Because the longer root will be examined after the shorter one, the algorithm has all the information necessary to make the right choice.

The higher costs of prefix fragments over roots ensures that prefix sequences are only chosen over alternate root fragments when appropriate, and the various words beginning with “ab” in Table 1 show examples of this behavior.  The fact that infix sequences are far more expensive than any other component is designed to prevent the possibility of multi-root words being chosen over longer roots as a result of being split by small one or two letter infix sequences.

The plus 1 term for root words is critical in ensuring that any word that can be interpreted either as two adjacent roots or as a single longer root is resolved in favor of the longer root.  An example in English might be the word “beestings” which has nothing to do with bees or their stings.  In addition, the inner recursive routine keeps track of its recursion depth which corresponds to the number of roots that have been chained together in the path, and it also adds this depth value to the cost for any root word fragments.  This results in paths constructed by chaining multiple small roots becoming progressively more expensive as the number of roots increases.  This again has the beneficial effect of further favoring longer roots.  As mentioned above, experience has shown that these curves appear to work quite well regardless of language, however, the framework allows alternate cost curves to be substituted for any language if the defaults do not result in easy dictionary management.

Mapping Fragments to Stemmed Output

Another aspect of the outer stemming algorithm that still needs to be explained is it’s behavior in copying the results of the stemming process back to the stemmed output buffer.  The inner recursive algorithm, which evaluates root and infix sequences, returns the result of the inner parse into a buffer (which may contain one or more words as a result of root word mapping) to the outer routine.  Since both prefix and suffix fragments can optionally be mapped to an output word, the outer routine must assemble a sequence of words in the right order to represent the extracted meaning.  This behavior of outputting multiple words in response to stemming is unique to the new stemmer and yet it is critical to improving search results since without it, prefix stripping (at least in English) becomes impossible, and the dictionaries must be much larger and also less expressive.  Examples of the need for this are shown in Figure 1 (earlier post) where the words “abnormally” and “abaxial” are broken into two output words with a leading “not” that is generated as a result of prefix mapping.  In English, if the stemmer failed to do this, and the search was for the word “normal”, then no hits would be returned for documents containing the word “abnormal”, even though this is incorrect behavior.  Note however that with the new stemmer this is not the case.  Also note that searches for “abnormal” will return not only hits containing words derived from “abnormal”, but more importantly, if the author chose to say “not normal” instead of “abnormal”, these documents will also be returned as a result of stemming on the query terms.

The mapped fragments for the prefix and suffix, if relevant (suffixes in English rarely result in a mapping, whereas prefixes usually do, but the situation is different in other languages), are fetched from the appropriate dictionaries and pre-pended by default to the output text.  Note however that the algorithm allows the dictionary entry to start with a “+” character, and this instead results in appending the word after the word(s) resulting from root mapping.  In other languages, due to the differing ordering or words, it is often more correct to append rather than prepend affix mappings, and clearly this is fully supported by the algorithm.  The multi-root words in Table 1 such as “abdominothoracic” and “abdominoanterior” are also examples of how existing stemming approaches fail to provide accurate search results, since if the search was on the word “abdomen”, neither of these words would result in hits in a classic stemming system.  Similarly a search on “thorax” would fail to return documents containing the word “abdominothoracic”.  In certain specialized fields such as medicine or chemistry, where words are commonly constructed from multiple significant roots (the word divinylbenzene in Table 2 is an example), this shortcoming in today’s text search systems can be critical, and hence the new stemming algorithm can provide considerable benefit.

One complexity that has been glossed over up until now is the fact that certain prefix and suffix fragments may well be identical to a root fragment (examples are the prefix “arch” or the suffix “less”).  In these cases the algorithm will probably choose the root over the affix because of the cost curves, whereas we may instead wish to re-interpret (or suppress) the meaning if the root is being used as an affix rather than a root.  For example in the English word “clueless” we may wish to map the trailing root “less” in this case to cause the stemmed output to become “without clue” which is the correct effect of the suffix “less”.  Clearly we cannot do this if the “less” is recognized as a root since its meaning as a root is simply “less”.  In fact the situation can be considerably more complex, an example being the trailing suffix/root “able”, “ably”, “ability”, “abilities”, “ableness” etc.  If we wish to recognize these words as suffixes and strip them off to produce no output (i.e., “teachable” -> “teach” and not “teach” “able”), we must make sure that for the root “able”, all possible suffix forms are recognized by the root dictionary, and then in the suffix dictionary we must also enter each suffix form of “able” and map the output to “-” alone.  A mapped output starting with a ‘-’ character forces the affix to be output before the root words, but this is redundant since this is the default behavior, and so a mapping of just “-” with no additional text can be used to suppress output altogether.  Similar cases occur with roots used as prefixes so for example when the root “arch” occurs as a prefix (as in archangel), its meaning on output must be re-mapped to “chief” rather than “arch”.  The logic associated with detecting and performing these essential re-mapping behaviors is quite complex and is invoked within the algorithm as required.  Without these behaviors, certain words in many languages cannot be stemmed to their true meanings.

Some of the same problems occur with certain root fragments independent of affixes.  For example, consider the English word “sexennial” which may be recognized as a fragment leading to recognizing the root word “six” but the remaining “ennial” cannot realistically be associated with the word year since the spelling in this case is not derived from “annual”.  In this case we want to preserve the hidden root “ennial” as “year” in our output as in “six year”.  To do this we can enter the fragment itself under the root “six” using the form “sexennial+year” and as in other occurrences of the ‘+’ character is dictionaries, this causes the text that follows the ‘+’ to be output after the root.  In fact in other languages, the situation can be more complex than this and the root dictionary fragments support any of the following forms:

  1. fragment — simply copy the corresponding root word to the output buffer
  2. fragment- — this means the root cannot be followed by other roots.  Handy for dealing with overlapping situations such as “super” and “superb” as in the word “superbitch” in order to force “super bitch” rather than “superb itch”.
  3. fragment-root2{ root2n} — copy root2 (and optionally any following roots) to output buffer, followed by space, followed by root word
  4. fragment+root3{ root3n} — copy the corresponding root word followed by a space, then root3 (and any following roots) to the output
  5. fragment-root2{ root2n}+root3{ root3n} — effectively the same as adding the two preceding forms allowing output both before and after the root

Once again, the logic to handle these output mappings is complex and in broken out into a separate routine invoked by the main stemmer routines.  The dictionary option “kStopAtPlusMinus” discussed under the “makeStemmerContext” plug-in is required in order to support this behavior and by default is set for the “fRoots” dictionary.

OuterFn

InnerFn

In this post we are only interested in describing the behavior of the algorithm at a sufficient level to allow understanding for the purposes of populating the stemmer dictionaries and resolving conflicts in the cost calculations when they occur. 

The top listing (1) above gives the high level pseudo-code for the outer function of the algorithm, while lower listing (2) gives the pseudo-code for the inner recursive function.  For clarity, most of the details have been omitted from both listings.  The routine of (2) has two sections, the upper section handles recursive calls looking for the sequence “{[ infix ] Root }” while the lower section handles the initial “Root” term which cannot be preceded by an infix sequence.  Obtaining a list of all possible infix sequences in handled in the same manner as obtaining a list of affixes in the outer algorithm, but uses the ‘fInfix’ dictionary and given this list, the routine loops over all possible infixes, computing the infix cost and then calling a variant of LX_Lex() to recognize any trailing root.  If a trailing root is found there are two possible conditions, either that root contains enough characters to complete the path, or it does not, in which case the routine must call itself recursively to look for another “[infix] Root” path to complete the parse of the remaining characters.

As with the outer routine, the infix and root costs are compared with the minimum cost found so far in order to determine if a new lower cost path through the inner text has been found, and if so, the mapped output for this path is copied to the output buffer.  Note that the routine adds the recursion depth ‘depth’ to the cost for any root found, and, as discussed previously, this logic is intended to favor finding longer roots over sequences of smaller roots.  In the case where the current root is the last root required to complete the parse, the routine contains logic to determine if this root is longer than any previously found trailing root, and if so, to alter the root cost in order to cause the resulting path to be 1 cheaper than the previous path with the shorter trailing root.  As discussed previously, this logic is intended to favor longer trailing roots over longer suffixes.

The actual code for these routines is considerably more complex, and is heavily instrumented to allow callers to obtain a detailed insight into the paths explored by the algorithm and the cost calculations involved.  This fact is taken advantage of extensively by the development GUI to allow the dictionary developer to understand how the algorithm is operating in any given stemming sequence.

Stemming Compound Words, an Example

To give an example of the true power of the algorithm to take apart complex words, consider the English word “abetalipoproteinemia” which is a rare disorder affecting the absorption of fats and vitamins.  The disease is related to the affliction “lipoproteinemia”, which is in turn derived from the root words “lipid”, “protein”, and “emia” meaning disease or sickness.  The “abeta” portion of the word refers to a special group of lipids known as “beta” lipids and the preceding “a” implies that the root of the disorder is the inability to transport beta lipids.  Obviously any conventional stemming algorithm has absolutely no chance of maintaining the relationships of this word with any of the constituent parts of the word after stemming, and thus anyone searching on any of these concepts will be unable to directly find any reference to this rare disorder.

StemmingPaths

Figure 1- Stemming the Word “abetalipoproteinemia”

Figure 1 above shows the debugging printout from the Mitopia® stemming development environment when asked to stem the word “abetalipoproteinemia”.  Each line of the printout represents a possible path to stemming the word explored by the algorithm.  Notice first that the stemmer output is “not beta lipid protein sick” which amazingly enough, is exactly the output one would require in order to maintain searchability of this disease and all its related concepts.  The format of the “Paths” listing in Figure 1 will be described in detail later, however, from the listing we can see that the algorithm explored a total of 81 possible paths through this word, finding just 3 complete paths to account for the entire word.  Of the paths found, the lowest cost path (with a total cost of 38) was comprised of the following sequence of fragments:

Prefix:’a’, Root:’beta’, Root:’lipo’, Root:’protein’, Suffix:’emia’

After output mapping, this path results in the correct stemming output of “not beta lipid protein sick”.  Clearly anyone searching on the terms “lipoproteinemia”, “proteinemia”, “lipid”, “protein”, “beta lipid” and any one of a number of other search terms, would have correctly found this reference giving this stemmer output, and such cannot be said of any other algorithm.

To clarify the path searching nature of this algorithm including the inner paths, Figure 2 below illustrates the complete set of paths explored by the algorithm for this one word as indicated by the path listing shown in Figure 1.  In Figure 2, dotted paths indicate a failed attempt to connect the node to indicated suffix (an unconnected dotted path indicates failure to connect to all suffixes).  The chosen path is indicated in red while the two alternate paths (more costly) are indicated in blue.  The label “r=” indicates a fragment chosen from the “fRoots” root dictionary while “i=” indicates a fragment chosen from the “fInfix” infix dictionary.

abetalipoproteinemia

Figure 2 – Paths Through the Word “abetalipoproteinemia” (Click to enlarge)

Understanding Stemmer Path Traces

The format of the shortest path search trace displayed in the paths display control is a series of one line entries comprised of the bracketed path segments that make up the path being explored as follows:

  1. [P:prefix | cost] – This is the prefix sequence in effect for the path concerned and also the cost for that prefix according to the cost-calculator function. If there is no text between the ‘:’ and the following ‘|’, no prefix was in effect. In most cases is there is no prefix in effect, the “[P:…]” sequence is omitted entirely.
  2. [S:suffix | cost] – This is the suffix sequence in effect for the path concerned and also the cost for that suffix according to the cost-calculator function. If there is no text between the ‘:’ and the following ‘|’, no suffix was in effect. In most cases is there is no suffix in effect, the “[S:…]” sequence is omitted entirely.
  3. [I:infix | cost] – This is the infix sequence in effect for the path concerned and also the cost for that infix according to the cost-calculator function. If there is no text between the ‘:’ and the following ‘|’, no infix was in effect. In most cases is there is no infix in effect, the “[I:…]” sequence is omitted entirely.
  4. [R:root | cost] – This is the root sequence in effect for the path concerned and also the cost for that root according to the cost-calculator function.

Since there can be multiple roots in sequence with optional infix fragments between them, multi-root words may display a repeated series of infix and root blocks on any given line.  Figure 1 in the preceding section shows a good example of a large and complex path dump.

Example1stThe order of the lines appearing in the output correspond to the search order used by the algorithm to explore the paths.  If a path fails to complete a successful parse, the corresponding line terminates with the text “– No Match”.  If the algorithm determines that the cost of the path so far exceeds the current minimum cost, even if it has not completely explored the path, it aborts and the line is terminated with the text “– Abort” (see the screen shot to the right).  This optimization behavior is not detailed in the description of the algorithm since it is not relevant to the algorithm itself but is merely a timing optimization.  In addition, the algorithm contains a further optimization to not explore the recursive inner portion for the same prefix if on a previous pass, the total number of characters processed is less than that implied by the selected suffix length, since it is clear without exploration that this would not lead to a positive result.  This optimization will cause the path line to be terminated by the string “– Not Tried”.  This optimization can significantly improve execution time, but is not described in this document to avoid obscuring the basic algorithm.

Example2stOccasionally for the root word cost, the form “cost<-cost” may be displayed as in the screen shot to the right (for the word “abbacies”).  This is an indication that the logic associated with favoring longer trailing roots has caused the root cost to be adjusted downward (from 9 to 8 in this case) in order to accomplish this feature.  You may also see cost editing indications for affixes which are the result of other specialized logic to try to force correct operation without the need for word-specific adjustments in the largest number of cases.

Example3st

Example4st

 

 

 

 

 

 

As previously discussed, it is quite often the case that an affix sequence is also a valid root word but when used as an affix, we want it to output a different meaning (or possibly to suppress the root output entirely).  You will see cases in the paths listing where the techniques described to re-map affix outputs to change or suppress the root shown as entries of the form “[R<-P:cost]” or “[R<-S:cost]” depending on if a prefix or suffix is involved.  The screen shots above for the words “perigastrulation” and “formerness” illustrate the appearance of these features.

In the prefix example, the word “peri” exists in both the root word and the prefix dictionaries but it is mapped to the root word “around” in the prefix dictionary.  When “peri” is used as a prefix as part of a shortest path, the fact that it has a mapping to the root “around” is discovered and the form “R<-P” indicates that this mapping has been made by the algorithm (with a corresponding cost reduction from 9 for the prefix “peri” to 7 for the root “peri”).  In the “formerness” case, the word “ness” is both a suffix (in which case it should be suppressed entirely i.e., mapped to “-”) and a root.  In this case the recognition of the suffix “ness” as part  of a shortest path results in mapping to the root form “ness”  which is suppressed on output by the “-”.  This mapping actually results in a cost increase from 6 for the suffix to 8 for the root form.  The use of these techniques to resolve stemming conflicts between affixes and roots is discussed later.  Be aware that the lines beginning with “Min COST:” in the listing indicate that one of the preceding lines contains a minimum cost path though not necessarily the immediately preceding line.  This delay in printing the minimum cost indication is caused by the recursive nature of the algorithm and it means that you may have to scan upwards in the listing to find the line that corresponds to the cost indicated.  Possible candidate lines will always end without a trailing text sequence such as “– No Match”.

The use of the root fragment mapping form “fragment-” (see earlier discussion) may result in output of a line ending in “– Inhibit” which implies that no further roots were examined because the earlier root fragment cannot be followed by other roots.

Because of the recursive nature of the stemming algorithm, the sequence in which the various affixes are tried determines the order in which the entries in the list are tried and printed out.  As discussed earlier, the outer function of the algorithm first tries the case where there is no prefix, and within that case it then loops over the suffix cases starting with no suffix, then from the longest suffix down to the shortest.  The outer function then proceeds to the longest available prefix and loops through all suffix forms as for the previous loop.  This process is repeated until the possibilities for the shortest available prefix have been fully explored.   Since this is the order the algorithm uses to explore the paths, this is also the order in which the path dump is listed.  The prefix and suffix portions of the list for each line are always displayed before the results produced by the inner recursive function of the algorithm for the infix/root sequences.

The root listing will often contain lines that do not fit into the space available for the text control that contains the listing and these lines will be “wrapped” and continue onto the next line of the display.  You can increase the horizontal size of the text control by growing the window and also by enlarging the lower panel area allocated to the paths listing by use of the panel resize control.  See the GUI description for details.

Many of the lists displayed in auxiliary windows associated with the stemming development environment contain a list column entitled “Min Cost Path”.  This column contains the one line from the full path listing (as described in this section), that was chosen as the final minimum cost path for the successful stemming operation.  Understanding the contents of the cells in the “Min Cost Path” column is therefore just a subset of the problem of understanding the full path list described in this section.

Summary

As can be seen, in the case of this word, there are approximately 40 distinct ‘nodes’ in the diagram and the relationship between this algorithm’s approach to stemming words and the “traveling salesman” or shortest path problem is clear.  Shortest path problems are normally very compute intensive, however in this case the combination of the fast lookup using Mitopia’s lexical analyzer together with taking advantage of the directed nature of the paths, allow the algorithm to be implemented with no significant increase in execution time over traditional stemmer algorithms.

The most important thing to remember about stemming is that its purpose is to improve the accuracy of text search, not to improve the readability of the output text (which may never be viewed).  It is clear then that if the system is searched using the key word “abetalipoproteinemia”, it should return all instances of this word (and plurals etc.).  You might think that since the word was broken up into the phrase “not beta lipid protein sick” as it was indexed into the database, that this would make it harder to find when the search term was “abetalipoproteinemia”.  However, the search terms themselves must obviously be stemmed in an identical manner before they are checked against the database index, and so a one word search for the word “abetalipoproteinemia” is turned into a search for the sequence of words “not beta lipid protein sick” immediately next to each other, and hence correctly returns all matching words in the text database.  This means that since this stemmer violates the almost universal assumption that the output of a stemming algorithm is just a single word, the database engine itself must be modified somewhat to be aware of the multi-word stemmer output, and to modify queries into sequence queries as necessary to account for it.  Once this has been done, it is clear that we have a system that operates similarly in this simple case.   Of course this is only the tip of the iceberg as far as search is concerned, since each stemmed word in the phrase “not beta lipid protein sick” has been separately indexed into the database and so it is possible to search on “lipoproteinemia” or just “proteinemia” and find hits for these diseases as well as “abetalipoproteinemia”.  This is because a search for “lipoproteinemia” will be turned into a sequence search for “lipid protein sick” while a search for “proteinemia” will be a sequence search on “protein sick” and thus both key words will return hits for “abetalipoproteinemia” also.  This is as it should be obviously, but no existing system is capable of this kind of search due to limitations in today’s stemming algorithms.  Obviously, searches on “lipids”, “sicknesses” and a wide variety of other relevant terms will also correctly find references to this disease, which once again is a critical behavior for researchers in the medical field, and which is not possible with conventional search systems.  More importantly, as we will see later, the goal is to facilitate search across all languages without regard to the query language.  Compound words like this in any language will prevent and cross-language search for lack of a common root.  The new stemmer algorithm overcomes this problem by allowing words in all languages to be broken into simple and mappable root word sequences and thus a multi-lingual search in Arabic using the Arabic word for sickness should correctly return a hit for “abetalipoproteinemia”!

Over and above the higher accuracy of the Fairweather stemmer (up to 100% vs 30%)  in dealing with conflating word variants that may not follow the normal rules of the language (e.g., teach and taught) thereby improving search accuracy, it is the algorithm’s ability to maintain the important relationships within a word by converting complex words to word sequences, as illustrated above, that is its most important benefit, and which holds the potential to revolutionize text search as it exists today.

A Stemming Framework

We have previously discussed the subject of stemming and its importance to search.  Now for historical perspecive, I want to explore Mitopia’s stemming framework which allows a variety of stemming algorithms and techniques to be combined to achieve better results.  It should be noted however that this framework and its dictionaries was made largely irrelevant (though it is still in place for proper name identification) once Mitopia’s Fairweather stemming algorithm was developed.  This is because no other stemming algorithm can even approximate the same accuracy and so there is little point in including them.

Overview

Mitopia’s stemming solution comprises a unique framework for combining stemming algorithms together in a multilingual environment in order to obtain improved stemming behavior over any individual algorithm, together with a new and unique language independent stemming algorithm based on shortest path techniques (the Fairweather stemmer).  The stemming framework provides the necessary logic to combine multiple stemmers in parallel and to merge their results to obtain the best behavior, it also provides mapping dictionaries to overcome the shortcomings of classical stemmers by handling such things as irregular plurals, tense, phrase mapping and entity recognition (to prevent name stemming), even though these dictionaries are largely unnecessary when using the Fairweather stemmer.

The Fairweather stemmer uses a dictionary driven approach to find and combine word parts, and is thus a hybrid stemmer combining aspects of the Affix and brute-force approach, while avoiding the problems with both by using a shortest path algorithm to select from one of a number of possible strategies to stem any given word.  The algorithm is far more powerful than any other existing approach, and is based on the premise that in essentially all languages, it is possible to construct all word variations from the following basic pattern:

[ Prefix ] Root { [ Infix ] Root } [ Suffix ]

Where “[…]” implies zero or one occurrence and “{…}” implies zero or more occurrences.  In this pattern, the following word components are defined:

  1. Prefix – is a prefix modifier that will be discarded (or converted to a separate root word) on output.  An example is the prefix “un” (meaning not) in the word “unbelievable”.  This will be converted to the form “not believe” as a result of prefix recognition.
  2. Root – is a root word fragment in the language.  An example is the word “teach” which is the root of “teaching”, “teachers”, “taught” etc.  The fragment may differ from the actual root and is mapped to the true root as a result of stemming.
  3. Infix – is an infix modifier occurring within a compound word which will be discarded (or converted to a separate root word) on output. Infixes are rare in English, but common in some other languages where words are assembled out of smaller fragments.  An example of an infix sequence in English might be “s” as in the dual root word “woodsman”.
  4. Suffix – is a suffix modifier that will be discarded (or converted to a separate root word) on output.  Examples in English are such endings as “ing”, “s”, “able”, etc.   Standard English stemming algorithms tend to focus on suffix stripping only

Given this model, the stemmer essentially treats the stemming problem not as a linguistic problem, but as a simple instance of the shortest path problem (e.g., the traveling salesman problem), where the cost for each path can be computed from its word component and the number of characters in it.  The goal of the stemmer is to find the shortest (i.e., lowest cost) path to construct the entire word.  To do this it uses dynamic dictionaries, constructed as lexical analyzer state transition tables, to recognize the various allowable word parts (via a unique dynamic table driven lexical analyzer) for any given language in order to obtain maximum speed.

The state diagram in Figure 1 below shows all the possible paths the stemming algorithm can follow in order to stem a complete word, where each of the transitions is triggered by recognizing the next sequence of characters encountered in the word as being in either the ‘prefix’, ‘root’, ‘infix’, or ‘suffix’  part dictionaries.  If the stemmer finds a path from ‘Start’ to ‘End’ such that every letter of the word is accounted for by valid transitions in this state diagram, it has found one solution for stemming the word and it records the cost (as described later) for the entire path and then re-examines every other possible path permitted by all the fragments contained within the dictionaries provided for the language.  The algorithm always selects the lowest cost path found as the stemmed result.  If at any point in the process, the algorithm finds itself in any numeric state without an appropriate entry in the necessary part dictionary to get it to the next state, the path has failed and the algorithm falls back to examining other possible paths.  Note that it is quite possible for the same sequence of letters to appear in more than one part dictionary so the number of possible paths to examine can be quite large.

StemmerStates

Figure 1 – State Transition Diagram for Fairweather Stemmer

To see in detail how this works, consider the word “abacterial” (meaning not bacterial).  The stemmer examines 32 possible paths to stem this word, finding five solutions to parse the complete word as follows:

 PathsAbacterial

As a result, the algorithm picks the lowest cost path and then (as described later) maps the various fragments used to their final output, resulting in a stemming output of “not bacterium”.  As can be seen, in this case the stemmer has not only correctly identified and preserved a significant word prefix (i.e., negation), but it has also mapped the plural form bacterial into the correct singular form bacterium.  Other possible but discarded results of this stemming action given the solutions above are “abaction” and “not act”, both of which are obviously wrong since they have completely different meanings from the original word.  Note that the ‘suffix’ dictionary used by the algorithm is roughly equivalent to the behavior of the classical suffix stripping algorithms, and yields similar results if used in isolation.  In this case, without the ability to recognize prefixes (in this case “a”), the result of a classical stemming algorithm would be the equivalent of “abaction” and this would result in incorrect search results, and the loss of any connection between the word “abacterial” and bacteria in general.

SuffixStripperThis of course brings up the question of what is the theoretical upper limit on classical suffix stripping stemmer algorithm accuracies given the fact that such algorithms can only recognize the two forms “root” and “root,suffix”, that is their equivalent state diagram would be as show to the right.  The answer can be determined by examining the confirmed universal stemmer outputs to see which fraction conform to one of these two forms in the same 450,000 English word list.  The answer is that only 62% of all correct stems are comprised of these simple forms.  In addition 8% of root words end in valid suffixes, and would thus probably have been over-stemmed by a suffix stripper.

This means that the theoretical maximum accuracy of a suffix stripping algorithm is somewhere between 54 and 62% regardless how sophisticated it is.  For other languages, the theoretical limit would generally be significantly lower.  This accuracy is also the upper limit on search accuracy for any system using such a stemmer, and this is clearly far from ideal.  It is for this reason that MitoSystems developed the Fairweather stemmer algorithm, in order to ensure that Mitopia’s search accuracy was as close as possible to 100%, regardless of the language being used to search.

We will discuss in a lot more detail later exactly how the algorithm performs this stemming process, but before we examine the Fairweather stemmer algorithm in detail, we must first understand the function and operation of Mitopia’s stemming environment itself, for which the Fairweather stemmer is simply one of a number of possible stemming algorithms for use on a given language.

Detailed Description of Stemming Framework

Mitopia’s stemming framework is implemented as a plug-in architecture where each distinct language to be stemmed must register the necessary plug-ins in order to implement the complete suite of logical behaviors required by Mitopia® in order to support stemmed database query.  The stemming framework supports the language specific registered plug-ins listed below, which are sufficient to allow the framework to invoke all stemming behaviors that are language specific, while allowing all other logic to be handled by the Mitopia® stemming framework itself.  This greatly simplifies the complexity of creating and connecting a stemmer for a new language. These plug-ins are registered symbolically with the framework during program initialization by language name, and plug-in name, where the plug-in symbolic names are as follows:

  1. makeStemmerContext” – (typedef QX_MakeContextFn) This plug-in is called during framework initialization for each language registered.  The framework provides and maintains a language-specific stemmer context allocation which is passed to all plug-ins for that language when they are subsequently invoked.  This context structure can be used by the stemmer to store whatever information and dictionaries it needs as part of its operation.  Context allocations must start with a “basic” context structure containing information needed by the framework itself, and which is sufficient to support operation of the Fairweather stemmer, without the need for any additional context, in cases where no other stemming algorithm is available for the target language.
  2. killStemmerContext” – (typedef QX_KillContextFn) This plug-in is called by the framework during framework termination for each language registered.  The purpose of this callback is to allow cleanup of whatever custom context and allocations were stored by the stemmer during the corresponding “makeStemmerContext” call.
  3. invokeStemmer” – (typedef QX_InvokeStemmerFn) This plug-in actually invokes the stemmer(s) registered for the language by calling QX_UniversalStemmer(), and allows any necessary language specific pre- and post- processing to be performed on the text.  For UTF-8 text streams pre- and post- processing is rarely required.
  4. scanWord” – (typedef ET_ScanWordFn) This plug-in is called by the framework in order to extract the next word in the language from the incoming text stream, in order to pass that word to the language-specific stemmer(s) for stemming.  Word scanning is preferably accomplished by a call to the lexical analyzer described later.  The only difference between one language and the next is essentially the range of valid UTF-8 characters that are allowed in the language involved.
  5. prepareForIndexing” – (typedef ET_PrepareFn) This plug-in is called by the framework on a block of text to be stemmed prior to beginning the word-by-word stemming process.  The purpose of “prepareForIndexing” is to scan through the text converting it all to lower case (if appropriate in the script involved) while preserving and protecting from stemming any proper names, acronyms and other specialized words that should not be stemmed.  Words that are to be protected are wrapped in the source text by inserting enclosing delimiter sequences, thus causing the stemming algorithm to skip over the word without processing it.  This plug-in is designed to overcome the tendency of conventional algorithmic stemmers to stem words that should be left unaltered, for example without this step, the proper name “Hastings” will be stemmed by the “Porter” stemmer to “hast” which will be conflated with words derived from the root “haste” resulting in incorrect search results.

The “invokeStemmer” Plug-in

Given these plug-ins, which will be discussed in detail later, other than housekeeping functions, there is just one key function that represents the primary interface to the Mitopia® stemming framework this is QX_UniversalStemmer().  This function is invoked by the language specific “invokeStemmer” plug-ins in order to perform the actual stemming operation.  Regardless of language, “invokeStemmer” plug-ins are typically virtually identical and would as shown in Listing 1 below.

InvokeStem

Listing 1 – Form of a typical “invokeStemmer” framework plug-in

The additional parameters implied by the ellipses parameter “…” in the function prototype above include such things as references to the inverted file index (if appropriate), and additional context information that are not relevant to describing the algorithm itself.  The function QX_MapRootsToEnglish() is provided by the framework and uses the root word mapping dictionary described below in order to convert any text in the output buffer to the equivalent English text wherever a root word mapping can be found from the native language stemmed root to an English root word(s).  The purpose of this logic is to implement the conflation of roots into a common cross-language root set, which in Mitopia® has been chosen to be the English roots.  The form of the “invokeStemmer” plug-in allows it to be called from a database subsystem during indexing and furthermore allows the creation of both a native roots index file (to allow language specific search) as well as a mapped roots index file (to allow cross-language search) simply by calling the stemming plug-in twice with different parameters.  The plug-in is also called on query terms in order to convert the query words into the appropriate stemmed equivalents so that the database can efficiently search the inverted index file(s).  The code for all non-English “invokeStemmer” plug-ins is virtually identical and for this reason the function QX_NonEnglishStemmer() has been supplied which performs all the basic necessary logic (as in Listing 1) including the call to QX_UniversalStemmer() thereby reducing the code for any non-English stemmer to a single function call (with the appropriate parameters!).

The QX_UniversalStemmer() Function 

The function QX_UniversalStemmer() performs the following logical steps in order to combine the results from one or more stemming algorithms for the language concerned, as well as handle basic dictionary mapping to improve current generation stemmer behavior (not necessary for the Fairweather stemmer algorithm):

  1. Check in the phrases dictionary (if present) and if recognized, replace by the phrase meaning text.
  2. Check in the tense dictionary and if recognized, replace by current tense root and continue.
  3. Check in the irregulated plural form dictionary and if recognized, replace by the singular root and continue.
  4. Check in the fixup dictionary and if recognized, replace by the mapped form and continue.
  5. If any change made above, return.
  6. Temporarily truncate the input at the first word boundary (if more than one word present).
  7. Check the stop-word dictionary and if found, discard/ignore.
  8. If the word is less than ‘minLength’ letters, don’t stem it.
  9. Check if the word exists in the stemmed word dictionary and if it does, don’t stem it.
  10. Check against history as follows:
    1. If ‘cache’ supplied then
      1. If the word is already in the cache, return the previous cached result
    2. Else check if the word already exists in the stemmed inverted file(s), if it does, don’t stem it.
  11. Stem the word in parallel using each algorithm supplied in the ‘algorithms’ list and if a stemming action results:
    1. If the algorithm is the “Fairweather” stemmer
      1. Accept the stem and return the result (this stemmer is reliable!)
    2. Else if a word list dictionary is present:
      1. if stem word in dictionary OR original word not in dictionary
        1. accept the potential stem
      2. else
        1. discard the potential stem
    3. Else accept the potential stem
  12. If no change or stemming occurred, done.
  13. Check the dictionary of root words for each of the stemmed results and if any is found, take the longest one.
  14. Otherwise take the shortest stem.
  15. If there is a stemming cache in effect update it with the results

The various dictionaries required by this logic are either passed to the routine as parameters, or they are passed through the basic stemmer context record described below as set up by the “makeStemmerContext” plug-in call.

Note the fact that the input is not internally truncated to a single word until after the three mapping dictionaries have been applied.  This allows for things like thesaurus compaction as well as dealing with the possibility of multi-word plural forms (e.g. “will be” becomes “is”).  The ‘charsConsumed’ result can be used by the caller to determine how many characters have actually been consumed from the input stream (see the description above for the “invokeStemmer” plug-in).  This function is completely script and language independent and merely performs dictionary and logical operations as part of the stemming framework, calling registered stemming algorithms as specified by an ‘algorithm’ list set up by the “makeStemmerContext” plug-in.  For these reasons, the function itself is completely multilingual and invariant across all languages.  The implementation and operation of the various dictionaries will be discussed later.  The beneficial effects provided by this federated approach to stemming include the following:

  1. The framework can handle irregular plural forms (e.g., men and man).
  2. The framework can handle irregular tense forms(e.g., teach and taught)
  3. The framework can handle phrase mapping (e.g.,”when hell freezes over” becomes “never”).
  4. The framework handles caching results thereby improving performance of any algorithm regardless of algorithm speed.
  5. The framework chooses the best result from all available stemmers allowing multiple algorithms to be combined for improved results.  In Table 1, we see that the output from the Porter stemmer (not the Fairweather stemmer) was used for the word “abjurdicating” (which in this case is actually a spelling error) since the Fairweather stemmer did not produce a result.  This means that even if the dictionaries for the Fairweather stemmer are incomplete, reasonable stemming behavior is still possible by adding another stemmer algorithm as a backup that is only invoked when the primary algorithm fails.
  6. The framework handles lookup of words in the existing stemmed inverted index.  This is critical in English for example when a word starts a sentence and thus begins with an upper case letter, since this behavior allows the system to correctly determine that the word is identical to it’s lower case variant occurring elsewhere in any preceding sentence, and thus avoids stemming errors as a result.
  7. The framework can handle multi-word variants (e.g., “will be” becomes “is”).
  8. The framework can implement a “thesaurus” mapping which causes different words with similar meanings to be conflated.  This can be used to provide another level of meaning search so that for example a search for “chair” would also return hits for “sofa”, “recliner”, “stool” etc.  Such aggressive conflation may not be desirable in all cases, but in certain applications might provide considerably improved results.
  9. The framework can handle common abbreviations for example the sequence TTYL appearing in instant messaging exchanges can be mapped to “talk to you later”.

 

The execution time for the function QX_UniversalStemmer() is approximately 3 times that of the Porter algorithm itself, which means that the overhead to obtain this improved federated performance (even if not using the Fairweather stemmer) is small and can have a significant impact on search accuracy.

In a very real sense then, Mitopia’s stemming framework can be seen as a first generation effort to overcome the limitations of classical suffix stripping stemmer algorithms by providing the necessary dictionary lookups to handle the language exceptions, as well as allowing multiple algorithms to be executed in parallel and choosing the best results of all available algorithms.  This is much like the  “Hybrid Approach” stemming algorithms described in the introduction with the added ability to use multiple algorithms.

Later on in Mitopia’s history, the Fairweather stemming algorithm was added to the registered set for each language, and if a Fairweather stemmer is available for the language and fully trained (as described later), it’s accuracy is close to 100%.  Even if it is not fully trained, but has a complete list of suffixes, the algorithm can give equally good results to a registered classical algorithm without the need to write any code!  This removes much of the motivation to set up and maintain some of the other dictionaries, with the notable exception of proper name dictionaries, and dictionaries that map multiple words into one such as the “phrases” dictionary.  The dictionaries associated with the Fairweather stemmer itself are quite capable of handling the functions provided by all the other framework dictionaries.  This means that if a new language is to be added and no classical stemming algorithm is available to be registered, the preferred approach is to simply develop and train a Fairweather stemmer for the language.  At the present time classical stemming algorithms are only registered for the English language, though at one time a classical Arabic stemmer was also registered.  It is probable therefore that going forward, there will be no need to register additional classical stemmers for any language.

Standard Dictionaries used by the Stemmer Framework

The purpose and need for the various standard dictionaries supported by the framework (which can be seen in the declaration of the public type ET_BasicStemCont) is summarized in Table 1 (below).

In this table, the “L/D” column indicates if the dictionary involved is simply a lookup recognizer “L” (i.e., a Boolean true/false) result, or is instead a dictionary “D” which maps any entry found to a corresponding output sequence (which may be empty).

In addition to these dictionaries which are passed through the basic stemmer context, the function QX_UniversalStemmer(), which forms the basis of all language specific stemmers within Mitopia®, can be passed via its parameter list, additional dictionaries alluded to in the logic sequences described above.  These dictionaries are as follows:

DictsSupported

Table 1 – Basic Dictionaries Supported by the Stemming Framework

  1. “phrases” – This is a mapping dictionary used to map sequences of words into a single word prior to attempting any stemming algorithm.  The “phrases” dictionary can be critical in handling common phrases with known meanings, as well as cases where tense causes one word to be split into two (as in the phrase “will be” which is the future of the word “is”).  In the process of developing a stemmer, adding a phrase dictionary is probably something one does after all single word variants in the development corpus can be handled correctly.
  2. “tense” – This is also a mapping dictionary originally envisaged to handle the failure of classical stemming algorithms to deal with irregular tense forms (e.g. teach and taught, run and ran, etc.).  If the Fairweather stemmer is being used, this dictionary is probably not needed since the algorithm itself can handle this problem quite effectively.
  3. “plurals” – Like the “tense” dictionary, this mapping dictionary was originally intended to overcome the limitations of classical stemming algorithms when faced with irregular plural forms.  As with the “tense” dictionary, if a Fairweather stemmer is in use, this dictionary is not required.
  4. “fixup” – Once again this mapping dictionary was intended to overcome classical stemmer shortcomings that did not fall into the categories described above.  As with the other parameter based dictionaries described above, it may be unnecessary if the Fairweather stemmer is used.
  5. “rootsH” – This simple lookup dictionary was intended to overcome the tendency of classical stemming algorithms to “over-stem” root words that end in common suffixes (English root words that end in ‘s’ are a classic case where suffix strippers invariably over stem).  In addition, the logic of the framework uses this root word list to distinguish between different stemming results for classical algorithms by favoring results that match a known root.  Once again, if the Fairweather stemmer is being used, this dictionary may be unnecessary.
  6. “wordListH” – As with the “rootsH” dictionary, this simple lookup dictionary is used by the framework to check various stemmer outputs against each other and favor outputs that are in the word list over those that are not.  As part of training the Fairweather stemmer, it is likely that a word list is available to use in training and so it is advisable to pass this into the framework even though it may not currently be used, since it may be used to provide additional capabilities in the future, particularly as part of spell checking applications.
  7. “stopWordsH” – This is a simple lookup dictionary for the words in the language that are essentially meaningless connector words (e.g., the, a, and etc.) and which should simply be eliminated entirely from the stemmed output stream in order to save space.  The use of stop words is still a subject of debate. There is no definite list of stop words which all natural language processing (NLP) tools incorporate. Not all NLP tools use a stop list, and some tools specifically avoid using them to support phrase searching. In Mitopia® the issue is somewhat more complicated due to the ability to search across multiple languages using a single query in any supported language.  In addition, Mitopia® creates three distinct inverted indexes for all data, an “un-stemmed” index which allows exact match search (and thus can overcome specific stop word and other issues), a native stemmed index in the language concerned, and a mapped and stemmed index in English.  This changes the tradeoffs for stop words somewhat.  The set of stop words in each language can be quite distinct from other languages and the word involved often has no equivalent when mapped to English.  Within Mitopia® each language should define a set of stop words that are specific to that language, and since when doing cross-language queries, the text is first stemmed in its native language, then root word mapped to English, and then stemmed in English, the end result will be that any English stop words generated as a result of the root mapping will also be eliminated prior to multilingual indexing.  This means that the developer of stemmer dictionaries for languages other than English does not need to include English stop words in their consideration but instead can concentrate only on those words that are language specific.

It is perhaps important to describe how this stemming framework is integrated with any database subsystem that uses it.  There are essentially just two points where stemming must be performed, the first is during the creation of the inverted index file used by the database to search text, and the second is on any text query terms passed as the result of user query to the database.   We will gloss over the details of how this connection is made for now since it is not relevant to this post.  We will also gloss over the details of implementing the various framework plugins described above since they too are not needed to reach our goal of discussing the Fairweather stemming algorithm itself which must wait till a future post.