Tip: Do Vertical De-duplication Correctly
Vertical de-duplication (i.e. identifying duplicate documents within only 1 custodian’s data) is not as efficient as horizontal de-duplication (i.e. identifying duplicate documents across 1 or more custodian’s data), but it is sometimes required by various agencies or even clients themselves.
To do vertical de-duplication correctly, you first need to start with an organized data structure. Starting with the exact spelling of the custodian’s name (preferably Last-name, First-name and middle initial, if you can get it, format) as the top level directory to where you will place the custodian’s data set(s).
Any data collected for the custodian should be placed within the corresponding custodian directory and easily identified (such as File Server, Exchange Email, Blackberry, etc.). The goal of the sub-directories is to keep track of where the data originally came from. You could do this backwards with the data sources being the top level structure and the custodian names as the sub-directories, but that becomes confusing and complicated to manage. Using the custodian name as the starting point will make things easier in the collection and processing stages. If done correctly, your structure should look something like this:

Keeping a log of all the custodian names is a crucial part of performing accurate vertical de-duplication, as well as logging the data sources (i.e. Exchange Email, File Server, etc.) each custodian contains. This is important because data is generally collected in batches during discovery. So, if a custodian has new data to be collected and processed, but the majority of the data has already been collected and processed, spelling the person’s name exactly as it was spelled before makes it very easy for de-duplication software to compare the new data against the old data. If this information is not provided, then most de-duplication software applications can not easily detect who created the data and it may treat the new data as a completely new custodian’s data, so none of the new data will be de-duplicated against the old data.
Following these simple steps will not only help you organize your data better, but it will make the overall eDiscovery process a little bit smoother for you and your preferred eDiscovery provider.
Posted by Andrew Wilson on January 1, 2008 | Permanent Link |
