Near-Duplicate Detection
Regular de-duplication techniques are very unforgiving. If one character is off between two documents that are almost 100% the same in content, regular de-duplication will treat those documents as unique and not associate the two in any way.
Near-duplication techniques solve this problem by comparing every word in a document against the words in other documents, thereby detecting word similarities. For example, let’s say a sales employee of XYZ company likes to send template-based emails to all of his clients on a daily basis. Each email sent is almost identical except the recipient is different and some of the content has slightly changed. Regular de-duplication would never categorize these documents as duplicate, because they technically are not duplicates, but they are considered near-dupes since the content is almost exactly the same.
Grouping near-duplicate documents can drastically speed up and optimize document reviews, because it enables a reviewer to easily compare like documents. Much like our other proprietary technologies, we built our near-duplication software from the ground up. We can compare billions of documents per-day and provide on-going near-duplicate detection for the life of a project. This means that documents processed in March will be analyzed and compared against everything processed in January and vice-versa, thus keeping the near-duplicate grouping consistent. Unlike other competing near-duplicate technologies ours does not identify false-positives because we do not use synonym look-ups for every word compared. This results in clean and accurate near-duplication in a fraction of the time.
For more information about our near-duplicate technology please send an email to questions@lksi.com
Posted by Logik Systems
