Discovery Software: Cloud Data Services Part 2
Posted by Steve Akers on Fri, Feb 11, 2011 @ 01:59 PM
As I mentioned in previous posts, the user could get into a situation where they have massive amounts of data “in the cloud” and then they have to search it to discover if their data has certain attributes that are relevant to a legal or regulatory matter. If the data size approaches the Petabyte range of magnitude, and the user has no way of searching the data that is in the cloud at the service location, or the user has no local “manifest” or index of what they “put away” in the cloud then they have a huge problem on their hands. Pulling all that data down merely to review or index it for search would take Petabytes of storage and months of effort to accomplish.
Our company was recently asked to participate in a very large discovery project of this scale. In that case it involved a very large multi-national company which could marshal large amounts of resources to address their discovery issue and it was not a situation where the data would be retrieved from a Cloud service. Most companies using Cloud services today could not undertake such a project. In the case of a mid-sized or small business it would not be feasible at all to retrieve everything from The Cloud, provision enough storage to hold the data and a large index and then find what is relevant to a given matter. For handling the discovery issues that will inevitably come at “the day of reckoning” with Cloud Services, another component of the service is required.
The question then is: “on which side of the data de-mark” does the search and discovery process have to exist and what comprises the service offering to make it useful to customers?
In The Cloud (Total Service) Approach
One way around this dilemma is to buy a Cloud storage solution that contains a very scalable search and indexing capability. A number of the email archiving companies provide this as part of their service to customers. It is generally limited to email and does not handle back up or storage data (SAN/NAS) but it does provide keyword-based ediscovery capability to help discover at least the first stage of data for a regulatory or legal event. These services lack capabilities like de-duplication and chain of custody mapping or legal production capability (one still has to provide that independent of the service) but at least the amount of data that has to be copied back to a legal hold or other storage location for review can be minimized. See Figure One below for a depiction of this total service approach.
Again a full-service deployment with this from a Cloud provider would employ a total governance solution. Digital Reef would provide other services on top of simple keyword search that would include:
- Near-duplicate analysis (which documents are similar in semantic meaning to keyword-only documents; this help eliminates missing information left behind by keyword-only searching)
- Conversational analysis (email and instant messaging analysis; “who spoke to whom”)
- Semantic clustering organizes what documents “belong together” to give a reviewer context
- Full meta data support for legal export; EDRM support for export to legal review software
- Exemplar content matching – a capability for grouping content toward example messages or files
- Conversion services – using OCR To make TIFF and JPEG documents and PDF documents searchable as a matter of course (a regular part of the service; this is missing in other services)
On Premise (Semi-Self Service) Approach
Some corporations prefer to retain an index and a “manifest” of everything that they put into The Cloud. The local on-premise solution shown in Figure Two (below) meets that requirement. Some major US corporations and some smaller companies with strict regulatory requirements have to keep a local index (within their custody and control) of any data that is sent off-premise for archiving and retention. This is accomplished with something like the Digital Reef portable product shown in Figure Two (below).
This solution also solves the problem of allowing a customer to “sweep” data in the local file stores and SharePoint servers intelligently into The Cloud. It also supplies the services discussed in previous Blogs around de-duplication, tagging of important content (adding Meta data that is meaningful to reviewers) and all the other services required of a regulatory governance solution.
Please see Figure Three (below) for a depiction of the virtual repository component that “maps” the many sources of data (local to an enterprise) that exist and exposes locations in which relevant data for legal matters may exist. SharePoint servers (sometimes hosted at multiple enterprise locations) and email servers and file storage systems (often at many different locations) may all contain relevant data for a legal or regulatory matter. Knowing where these are and what they contain is a large undertaking.
Having a business record that explains where the items that were archived in The Cloud were found originally is a requirement of data governance. The local index manifest of data is a business record that explains where data items were found before they were placed into The Cloud.
This is an important regulatory requirement that is missing in many Cloud Storage solutions today.
Summary
In any Cloud Storage deployment one needs to account for the inevitable “day of reckoning” when the data that goes into the cloud must come out of the Cloud. If the consumer of the Cloud Data Service is wise and considers what will be required when a discovery event occurs, they can deploy either an “in the cloud” or an “on-premise” solution that makes sense for their business. Cost, deployment difficulty and regulatory requirements for each type of business will drive the decision for each type and size of business consumer. Some things in the following table are important to consider in these situations.