Logik Labs

Want to see what’s cooking over at Logik Systems? Check out Logik Labs to see what interesting and new eDiscovery features and software we are developing for our clients.

While you are at it send us an idea you have for anything eDiscovery related. Most of our ideas come from our clients input, so please, help us help you and we can make the world of eDiscovery a better place!

Near-Duplicate Detection

Regular de-duplication techniques are very unforgiving. If one character is off between two documents that are almost 100% the same in content, regular de-duplication will treat those documents as unique and not associate the two in any way.

Near-duplication techniques solve this problem by comparing every word in a document against the words in other documents, thereby detecting word similarities. For example, let’s say a sales employee of XYZ company likes to send template-based emails to all of his clients on a daily basis. Each email sent is almost identical except the recipient is different and some of the content has slightly changed. Regular de-duplication would never categorize these documents as duplicate, because they technically are not duplicates, but they are considered near-dupes since the content is almost exactly the same.

Grouping near-duplicate documents can drastically speed up and optimize document reviews, because it enables a reviewer to easily compare like documents. Much like our other proprietary technologies, we built our near-duplication software from the ground up. We can compare billions of documents per-day and provide on-going near-duplicate detection for the life of a project. This means that documents processed in March will be analyzed and compared against everything processed in January and vice-versa, thus keeping the near-duplicate grouping consistent. Unlike other competing near-duplicate technologies ours does not identify false-positives because we do not use synonym look-ups for every word compared. This results in clean and accurate near-duplication in a fraction of the time.

For more information about our near-duplicate technology please send an email to questions@lksi.com

Posted by Andrew Wilson on December 5, 2007 | Permanent Link | Post a Comment

<<Page Numbers in Text>>

At the request of numerous clients we can now embed the page/bates numbers of TIFF’d documents in the corresponding body text of each TIFF’d document. We do this WITHOUT performing OCR (optical character recognition) of the TIFF’d document. This means that the extracted text is completely preserved throughout the entire process. Having the page number of a document embedded in the body text can be quite useful when matching TIFF images with the corresponding extracted text, especially for 1,000+ page documents.

Posted by Andrew Wilson on October 1, 2007 | Permanent Link |

Foreign Language Detection

Over the past year we have seen a rapid increase in the amount of foreign language projects. We aren’t entirely sure why this is, but it has given us another competitive advantage in our local market, since not many providers can accurately process foreign language data sets.

Extracting foreign language is fairly simple, but detecting what language the document was written in is quite difficult. Over the past few months we have implemented and tested a very clever method to do this with almost 100% accuracy.

What does this mean to our customers? Now they can identify which documents are Korean, Chinese, Japanese, etc. and then give them to the appropriate person for review. This process can dramatically speed up the review process, thus saving our clients time and money.

Posted by Andrew Wilson on July 30, 2007 | Permanent Link |

Foreign Characters in Concordance

Most of customers use Concordance for document review purposes. Concordance doesn’t officially support Unicode characters, but if properly configured you can easily push foreign characters into a Concordance database for your reviewing pleasure. You won’t be able to search the foreign characters, but you can at least read it.

Posted by Andrew Wilson on July 1, 2007 | Permanent Link |

HTML Lotus Notes

Unlike Microsoft Outlook, Lotus Notes does NOT have individual “container” files like .MSG or .EML. This is a problem when processing Lotus email databases for native file eDiscovery purposes. Most vendors get around this issue by converting/migrating Lotus Notes databases(.nsf) to MS Outlook databases(.pst), so they can provide emails on an individual basis.

We do it differently. We extract the Lotus data natively(meaning directly from Lotus) into universally opened .HTML files, which means our customers can open the files in their favorite internet browser. The HTML versions look almost exactly like the original Lotus messages with some slight formatting differences. This process is FAST; 3-4 times faster than MS Outlook extraction and provides much better results.

Posted by Andrew Wilson on May 1, 2007 | Permanent Link |

Summation eDII Files

You ask and you shall receive…eDII files. For our Summation customers, we can now provide a very easy to use eDII file for all of their eDiscovery databases. Our eDII files are class III compliant and with all the custom fields we provide can make your Summation database very powerful for review teams.

Posted by Andrew Wilson on April 1, 2007 | Permanent Link |

OCR-On-The-Fly is a Must

We do our best to provide a very accurate and searchable database to our customers. This means giving them a lot of useful text, but some of that text is in the form of a picture, like a fax or scanned image. These documents need to be OCR’d (optical character recognition) in order for them to be searchable. GridLogik now OCR’s all picture based documents on-the-fly during processing.

Posted by Andrew Wilson on February 1, 2006 | Permanent Link |

Embedded File Extraction

Emails can have attachments, so can office documents like MS Word or MS PowerPoint. The information in these embedded attachments can be very valuable, which is why we extract them and group them with their parent documents.

Posted by Andrew Wilson on January 1, 2005 | Permanent Link |

PDF Processing

Most customers want TIFF group IV images, but occasionally some customers need PDFs instead. GridLogik can now provide searchable PDF files instead of TIFFs. Just say the word.

Posted by Andrew Wilson on January 1, 2005 | Permanent Link |

3D to 2D AutoCAD Processing

We see a lot of AutoCAD documents like .DWG files. These files are best viewed in an AutoCAD viewer due to their usual 3D nature, but most customers just want them converted to images. So, we incorporated a way to cleanly process these filetypes into beautiful 2D black and white images.

Posted by Andrew Wilson on January 1, 2005 | Permanent Link |

Formatted Text

What is formatted text? Formatted text is text that looks, well formatted. Most vendors in the eDiscovery space get text from documents by “printing” the text to a print driver and then saving that text as a file or in a database. Some use third party tools like dtSearch to extract text. Both of these methods result in poorly formatted text, which is a bad thing for people(usually attorneys)reading the text.

So, we developed a way to pull the text directly from the application that created it. This method preserves the original format of the text with correct justification, line breaks, numbers, bullets, etc. It’s a fairly small innovation, but it can really help speed up the review process.

Posted by Andrew Wilson on July 30, 2005 | Permanent Link |

File Type Investigation

What happens when a file lacks an extension or is mislabeled? We see it all the time; a seemingly inconspicuous .XLS (MS Excel extension) file acting like a spreadsheet when in fact it is a .PPT (MS PowerPoint). In order to process this file correctly you have to know what type of file it truly is. During indexing, the first step of our process, GridLogik analyzes the content of each file and determines what application created it, enabling us to accurately process these problematic files with ease.

Posted by Andrew Wilson on July 30, 2005 | Permanent Link |

Excel Formatting is a Beautiful Thing

Not many people in the world of eDiscovery would suggest that Excel formatting is a beautiful thing, but we do, and for good reason. We developed a method to format Excel documents in such a way that they don’t look like complete junk when converted to images. How did we do this? Ask Sheng, he spent a few months refining a very complex algorithm that…


  • Detects how many characters are in each cell and adjusting the width and height accordingly(this means we don’t have to EXPAND the column or row fully, which results in a prettier picture)
  • Finds all charts on every sheet and dynamically creates a new sheet directly behind the sheet the chart came from. The chart is then placed on that sheet with original coordinates to where the chart came from. This process, called chart extraction, eliminates the possibility of charts being cut-off across pages.

Of course we do all the standard formatting options like unhiding rows/columns/sheets, converting in black and white text, providing headers and footers, print area extended fully, etc.

Posted by Andrew Wilson on July 30, 2005 | Permanent Link |

Native File Page Count Detection

We get asked all the time, “How many pages will this hard drive produce when converted to TIFF?” Most people in our industry use a set of industry averages to answer this question. These industry averages are all over the map and are generally very wrong.

So we decided to do something about it. Early on in the development of GridLogik we decided it would be very helpful to our customers if they actually knew with almost 100% accuracy how many pages/images a set of data will produce WITHOUT actually printing or converting it.

We developed a very sophisticated page extraction tool within GridLogik that actually determines how many pages each document, even Excel files(minus blank pages), will produce if we print/convert it. We store this valuable information in our database and it has proven very valuable for some of the larger cases where the amount of pages/images to review was critical to finishing the review on time.

Posted by Andrew Wilson on July 1, 2005 | Permanent Link |