If you have lots of digital documents you might already know how many duplicates or near-duplicates exist on your hard drive or on your network storage devices.
If you want to clean up the space - you need to find these duplicates. In this particular case we want to talk specifically about text based documents, like HTML, Microsoft Word, PDF, etc.
Some documents are exact copies, (or archived exact copies), and usually these are easy to find - just calculate good checksum and compare with others.
But if you are involved in anything related to the document life cycle (like project development),
then many of your archived documents will be copies of a document made during the life cycle of this document,
which basically are different versions of the same document. Usually the situation is worse than that, on top you may have different formats of documents,
like documents created in Microsoft Word and later converted to PDF format.
Similar problem exists for search engines, especially for global search engines like Google or Bing. The spider finds lots of documents,
which are duplicates, or near-duplicates, or different formats of the same document.
Each big search engine has tools to deal with such situation. Duplicate or near-duplicate documents can be discarded by the search engine,
or presented as a link to similar documents instead of littering the search results.
What makes matter even worse for the search engine is the fact that it has to make a decision if the document has near-duplicates among millions or even billions
of other documents almost instantly, as spiders keep crawling the web and new documents keep coming constantly.
The problem of Near-Duplicate Detection also relates to Plagiarism Analysis and Authorship Identification.
Near Duplicates Finder
The Near Duplicates Finder software is a Java program, which finds duplicates and near-duplicates of text documents based on internal text of a document and provides
a report for future action. For example, you can automatically delete found duplicates. Another options is to run the Cluster Finder, which will report the clusters of discovered near duplicate documents. Click here for more information about the Cluster Finder.
The Near Duplicates Finder works with different types of documents, including Plain Text, HTML, XML, PDF, Microsoft Office, OpenOffice, RTF, etc.
Applying patented method it extracts the text from documents and creates a fingerprint (or signature) for each document, which allows quickly find duplicates or near duplicates for this document.
The program uses a database index to store the document fingerprints and detecting near-duplicates in large collections should be very fast*.
The Near Duplicates Finder runs on Java platform 1.5 (or up) and can be used on Windows, Mac, UNIX, Linux, etc.
The home page for the Near Duplicates Finder Software can be found on the SoftCorporation LLC.
Web site http://www.softcorporation.com/products/neardup.
There you also can find the latest release, as well as all other information you might need regarding this project.
Click here to Download the evaluation version of the Near Duplicates Finder.
It is a full working version and you can run it free of charge for evaluation purposes.
For commercial usage of the software please contact us using email: email@example.com.
To run the software you need to download the jar file and required 3-rd party libraries listed here.
We recommend to put the 3-rd party libraries in "lib" directory located in the same folder, where you put the neardup-x.x.x.jar file.
You need to have JVM 1.5 and up installed on this computer.
For Microsoft Windows you can also download zipped run.bat file,
unzip it and enter the command: run DIR_WITH_DOCS, where DIR_WITH_DOCS is a directory with your documents,
for which you want to check if there are any duplicates or near-duplicates.
If you click on run.bat without parameters, it should produce following output:
Invalid input parameters: Invalid number of arguments
Near Duplicates Finder v.0.1.1
Usage: java com.softcorporation.neardup.DuplicatesFinder parameters ...
Parameters format: -parameter [value]
-start filename[,filename] directory / file(s) to search for duplicates (mandatory)
-report filename report file (by default report goes to ./report.log file)
-score the score to report the duplicate (default is 0.6)
-onlynew find the duplicates only for new documents
-gram number of words in a phrase
-purge clear files list from past runs
-db location of db directory with files list
-delete criteria remove duplicates by criteria (old, new, small, large)
-deletepath pattern remove only matching pattern files (mandatory for delete)
-verbose display progress information (on standard output)
Example: Find duplicates from text files in directory 'docs' and save report in 'report.log'
java com.softcorporation.neardup.DuplicatesFinder -start docs -report report.log
For more information visit web site: http://www.softcorporation.com/products/neardup
Check the classpath if you cannot get similar output (see example of the classpath setting in the run.bat file).
The classpath should have all jar files listed above. You also need to make sure the software has rights to write to local directory,
where it will save the report and (more important) the database in directory ./db.
The database is created automatically and every next run the software uses existing database and compares new documents with the documents processed in previous run
and already stored in the database. You can simply delete the directory ./db to start the comparison process all over again.
Note: The Near Duplicates Finder current version was designed for relatively large documents and may not work very well with small documents (size less then 1Kb).
This limitation can be easily removed.
For legal and licensing issues, please read the LICENSE.TXT file. This product uses Derby, Tika and Log4J Java Software developed by
The Apache Software Foundation (http://www.apache.org/).
See Apache License: LICENSE-3RDPARTY.TXT.
* - This statement was not verified for very large collections as currently the Near Duplicates Finder was tested only with hundreds of thousands of documents,
however we expect it to work with much larger numbers.
For more information send request to: firstname.lastname@example.org