Open Source Java Suggester.
1. What is the Suggester software?
The Suggestor is an Open Source Java program, providing suggestions for unknown (misspelt) words based on custom dictionary.
As a basic implementation Suggester can serve as a spellchecker. In this case all words have the same weight.
It includes a high speed suggestion engine, based on fast edit-distance calculation algorithm enhanced with Lawrence Philips Metaphone algorithm and private fuzzy-matching algorithm.
2. How can I use the Suggester software?
- Search engine suggestions, based on your custom word list.
- Human resources department, just create index of your employee names to suggest proper spelling.
- Medical field, for example the drug name suggestion.
- Misspelt word suggestions in any other fields, which require custom dictionaries.
High dictionary compression:
The word dictionary is compressed not only on a hard drive, but also in virtual memory.
Basic UK English dictionary contains about 57000 words and has a size about 90K.
Full English dictionary contains about 200,000 words (including names, abbreviations, geographic places, etc.) and it takes 236Kb file on a hard drive and about 2Mb space in memory.
Other languages are compressed even better.
For example, full Russian dictionary contains more than 1,300,000 words (including variants) and it takes 315Kb file on a hard drive and again about 2Mb space in memory.
Comparing original word list file in UTF-8 format with size more than 30Mb, the compressed size is close to 1% of original size.
High dictionary search and suggestion selection speed:
Dictionary case dependent / independent look-up takes about 0.002 / 0.005 ms per word, which comes to speed about 500,000 / 200,000 words per second. Suggestions search speed averages about 40 ms per set of suggestions for each unknown word on Pentium M 1.4Gz (with high quality of suggestions).
The Suggester software entirely written in Java 1.2.
Runs on any Java platform: Windows, Mac OS, Unix, Linux. Tested on JRE 1.2 and up.
Dictionary retains original word list:
The dictionary internal structure supports UTF-8 encoding and keeps all original words in a case sensitive format.
Did we mention that the Basic Suggester is free? Yes it is.
4. Where to get it?
The home page for the Suggester project can be found on the SoftCorporation LLC.
web site http://www.softcorporation.com/products/suggester.
There you also can find the information how to download the latest release as
well as all other information you might need regarding this project.
Click here to Download Free Basic Suggester.
o A Java 1.2 or later compatible virtual machine for your operating system.
o To run Index Builder you may need up to 512 Mb (or more) of virtual memory.
6. Basic, Advanced and Enterprise versions of Suggester software
There are 3 different versions of Suggester software:
o Basic Suggester - (free open source) uses one dictionary, where all words have the same weight. The Suggester Spell Check uses Basic Suggester.
o Advanced Suggester - (commercial) can use multiple dictionaries with different weights assigned to each dictionary and each word. It also supports multiple languages.
o Enterprise Suggester - (not ready for distribution) uses all features from Advanced Suggester plus has an ability to compress information at much higher rate than the Advanced Suggester.
It is achieved by removing repeated segments of a trie, which stores dictionary information. As a result each trie segment of the Enterprise Suggester dictionary is unique.
7. What is the Index Builder?
The Index Builder creates compressed index from your word list.
In the past the Index Builder was excluded from the Basic Suggester package. Not any more! You can build your own index from your word or phrases list.
Note, the Index Builder uses significantly more memory comparing with the classes providing suggestions,
however it is not significant limitation considering the amount or RAM computers have these days.
For example, to compile Polish dictionary, containing more than 3 million words, the Index Builder uses about 300 MB memory.
If the word list is sorted, this requirement significantly goes down.
The speed of Index compilation itself is pretty high. For example, on the laptop (Pentium 1.5 Mhz) to compile Polish dictionary it takes less than 5 sec.
However the process to read the words file, convert it to UTF-8 encoding and sort all words takes more than 20 sec:
Polish dictionary compilation
8. What are the Suggester Configuration files?
The Suggester can be configured to fit your requirements.
a) BasicSuggester Configuration file:
By default the file is located at the classpath: com/softcorporation/suggester/basicSuggester.config
LENGTH_MIN_ED_1 - minimum word length to apply edit distance = 1.
LENGTH_MIN_ED_2 - minimum word length to apply edit distance = 2.
LENGTH_MIN_ED_3 - minimum word length to apply edit distance = 3.
LENGTH_MIN_ED_4 - minimum word length to apply edit distance = 4.
WEIGHT_EDIT_DISTANCE - edit distance weight for results sorting.
WEIGHT_SOUNDEX - soundex or metaphone weight for results sorting.
WEIGHT_LENGTH - different word length. The weight for results sorting.
WEIGHT_LAST_CHAR - last character is different. The weight for results sorting.
WEIGHT_FIRST_CHAR - first character is different. The weight for results sorting.
WEIGHT_FIRST_CHAR_UPPER - first character is not in upper case. The weight for results sorting.
WEIGHT_FIRST_CHAR_LOWER - first character is not in lower case. The weight for results sorting.
WEIGHT_ADD_REM_CHAR - characters are added or removed. The weight for results sorting.
WEIGHT_FUZZY_PHON - Fazzy matching. The weight for results sorting.
WEIGHT_JOINED_WORD - Joined word. The weight for results sorting.
SEARCH_JOINED - search for joined words.
REMOVE_JOINED_VARIATIONS - remove joined variations.
JOINED_WORD_LENGTH_MIN - minimum joined word length.
JOINED_WORD_LENGTH_EDT - minimul joined word length to consider edit distance = 1.
CLOSE_WORDS_CUT - remove unrelated suggestions.
DELIMITERS - word delimiters.
DELIMITERS_JOINED - joined words delimiters.
b) Language Configuration files:
The Fuzzy matching algorithm uses these files to select the best suggestion for the language.
The file name should follow format: LANGUAGE.config.
Creating your own language files you can add more languages to the Suggester.
The files are located at the classpath: com/softcorporation/suggester/language/
LANGUAGE - the language identifier.
S1=S2:80[,Sn:##] - the relation (here it is 80) between strings S1 and S2, usually representing letters.
The strongest relation = 100 (default).
All language letters should be listed in the file, even if one letter has no relations to others.
9. Open source.
The Basic Suggester source code is published here.
The documentation is available for Advanced and Enterprise versions and is included in the "doc" directory of download package.
Here is the Suggester Manual compatible with Basic Suggester version.
11. Java Code Samples
Java code samples are included in the download package. Click on a link for more information on How to use the Suggester.
12. Web Examples
Advanced and Enterprise verions of Suggester software allow creating context sensitive spell-checker, which you can test here:
English Spell Check test
Russian Spell Check test
Virtual Keyboard for Smart TV
Wikipedia People Instant Fuzzy Search
Dictionaries are included with free spell-checker, which you download from here:
14. Release Notes
15. Licensing and Legal Issues
10 Jun, 2006. Initial release 1.0.0.
21 Oct, 2007. Release 1.1.2.
01 Feb, 2008. Release 1.1.3. Language configuration files update.
17 Aug, 2013. Open Source 1.1.2 Release.
For legal and licensing issues, please read the LICENSE.TXT file.
Basically there are no limitations to use or redistribute the code besides providing reference to original developer: SoftCorporation LLC.
Java (TM) is trademark of Oracle Corporation.