Java Suggester. More then just a spell checker.
1. What is the Suggester software?
In short - The Java program, providing suggestions for unknown (misspelt) words based on custom dictionary. System administrator can create a list of preferred words and assign higher weight to the list. As a basic implementation Suggester can serve as a spellchecker. In this case all words have the same weight. But don't be confused about "basic implementation". It includes high speed suggestion engine, based on fast edit-distance calculation algorithm enhanced with Lawrence Philips Metaphone algorithm and private fuzzy-matching algorithm.
2. How can I use the Suggester software?
- Spellchecker.
- Search engine suggestions, based on your custom word list.
- Human resources department, just create index of your employee names to suggest proper spelling.
- Medical field, for example the drug name suggestion.
- Misspelt word suggestions in any other fields, which require custom dictionaries.
3. Advantages
High dictionary compression:
The word dictionary is compressed not only on a hard drive, but also in memory.
Basic UK English dictionary contains about 57000 words and has a size about 90K.
Full English dictionary contains about 200,000 words (including names, abbreviations, geographic places, etc.) and it takes 236Kb file on a hard drive and about 2Mb space in memory. Other languages are compressed even better.
For example, full Russian dictionary contains more then 1,300,000 words (including variants) and it takes 315Kb file on a hard drive and again about 2Mb space in memory. Comparing original word list file in UTF-8 format with size more then 30Mb, the compression ratio is close to 1% (1.04% to be precise)!
High dictionary search and suggestion selection speed:
Dictionary case dependent / independent look-up takes about 0.002 / 0.005 ms per word, which comes to speed about 500,000 / 200,000 words per second. Suggestions search speed averages about 40 ms per set of suggestions for each unknown word on Pentium M 1.4Gz (with high quality of suggestions).
Portability:
The Suggester software entirely written in Java 1.2.
Runs on any Java® platform: Windows®, Mac OS®, Unix, Linux. Tested on JRE 1.2,
1.3, 1.4, 1.5.
Dictionary retains original word list:
The dictionary internal structure supports UTF-8 encoding and keeps all original words in a case sensitive format.
Did we mention that the Basic Suggester is free? Yes it is.
4. Where to get it?
The home page for the Suggester project can be found on the SoftCorporation LLC.
web site http://www.softcorporation.com/products/suggester.
There you also can find the information how to download the latest release as
well as all other information you might need regarding this project.
Click here to Download Free Basic Suggester.
5. Requirements
o A Java 1.2 or later compatible virtual machine for your operating system.
o To run Index Builder you may need up to 512 Mb (or more) of virtual memory.
6. Basic, Advanced and Enterprise versions of Suggester software
There are 3 different versions of Suggester software:
o Basic Suggester - (free) uses one dictionary, where all words have the same weight.
The Suggester Spell Check uses Basic Suggester.
o Advanced Suggester - (commercial) can use multiple dictionaries with different weights assigned to each dictionary. Also supports multiple languages.
o Enterprise Suggester - (not ready for distribution) uses all features from Advanced Suggester plus provides content dependent suggestions.
7. What is the Index Builder?
The Index Builder (part of Advanced Suggester) creates compressed index from your word list.
It uses significantly more memory (comparing with the part providing suggestions).
For example, to compile Polish dictionary, containing more then 3 million words, the Index Builder uses about 300 MB memory.
If the word list is sorted, this requirement significantly goes down.
The speed of Index compilation itself is pretty high. For example, on the laptop (Pentium 1.5 Mhz) to compile Polish dictionary it takes less the 5 sec. However the process to read the words file, convert it to UTF-8 encoding and sort all words takes more then 20 sec:
Polish dictionary compilation
8. What are the Suggester Configuration files?
The Suggester can be configured to fit your requirements.
a) BasicSuggester Configuration file:
By default the file is located at the classpath: com/softcorporation/suggester/basicSuggester.config
Parameters:
LENGTH_MIN_ED_1 - minimum word length to apply edit distance = 1.
LENGTH_MIN_ED_2 - minimum word length to apply edit distance = 2.
LENGTH_MIN_ED_3 - minimum word length to apply edit distance = 3.
LENGTH_MIN_ED_4 - minimum word length to apply edit distance = 4.
WEIGHT_EDIT_DISTANCE - edit distance weight for results sorting.
WEIGHT_SOUNDEX - soundex or metaphone weight for results sorting.
WEIGHT_LENGTH - different word length. The weight for results sorting.
WEIGHT_LAST_CHAR - last character is different. The weight for results sorting.
WEIGHT_FIRST_CHAR - first character is different. The weight for results sorting.
WEIGHT_FIRST_CHAR_UPPER - first character is not in upper case. The weight for results sorting.
WEIGHT_FIRST_CHAR_LOWER - first character is not in lower case. The weight for results sorting.
WEIGHT_ADD_REM_CHAR - characters are added or removed. The weight for results sorting.
WEIGHT_FUZZY_PHON - Fazzy matching. The weight for results sorting.
WEIGHT_JOINED_WORD - Joined word. The weight for results sorting.
SEARCH_JOINED - search for joined words.
REMOVE_JOINED_VARIATIONS - remove joined variations.
JOINED_WORD_LENGTH_MIN - minimum joined word length.
JOINED_WORD_LENGTH_EDT - minimul joined word length to consider edit distance = 1.
CLOSE_WORDS_CUT - remove unrelated suggestions.
DELIMITERS - word delimiters.
DELIMITERS_JOINED - joined words delimiters.
b) Language Configuration files:
The Fuzzy matching algorithm uses these files to select the best suggestion for the language.
The file name should follow format: LANGUAGE.config.
Creating your own language files you can add more languages to the Suggester.
The files are located at the classpath: com/softcorporation/suggester/language/
Parameters:
LANGUAGE - the language identifier.
S1=S2:80[,Sn:##] - the relation (here it is 80) between strings S1 and S2, usually representing letters.
The strongest relation = 100 (default).
All language letters should be listed in the file, even if one letter has no relations to others.
9. Why there is no source code?
The Suggester source code is not published. This can be changed in the future but not for now. There are several reasons for that. One is that the code is not ready for publishing and has little explanation of how it works and therefore it may create more questions then answers. Second is that it is not that difficult program and if you will try to write your own, you may end up with even better software, which will benefit all of us. And the last one is (unfortunately we have encountered it in the past) that there are some individuals, who will slightly modify the code and publish it using own copyright, without even mentioning the Softcorporation as the original author.
This is why one of the license requirements is not to decompile the code from the class files.
We would like to add here that the Softcorporation warranties that the code contains no any tracking parts. It does not attempt to connect back to the Softcorporation web site, and it does not use any advertisements or other commercial tricks. If you are using the Suggester software - the only way for us to know about it, is your email.
10. Java Code Samples
Java code samples are included in the download package. Click on a link for more information on How to use the Suggester.
11. HTML Examples
Suggester software includes free spell-checker, which you can test here:
Click here to run English Spell Check test.
Click here to run Russian Spell Check test.
12. Dictionaries
Dictionaries are included with free spell-checker, which you download from here:
Suggester Spellcheck.
13. Release Notes
-
10 Jun, 2006. Initial release 1.0.0.
-
21 Oct, 2007. Release 1.1.2.
-
01 Feb, 2008. Release 1.1.3. Language configuration files update.
14. Licensing and Legal Issues
For legal and licensing issues, please read the LICENSE.TXT
file.
Java (TM) is trademark of Sun Microsystems, Inc.