Contact Us

eDiscovery and Litigation Support

Current Articles | RSS Feed RSS Feed

Why is unstructured data such a difficult technical challenge?

  
  
  

As mentioned in earlier blog posts, while uncovering the need to identify, classify, and manage unstructured data content, my team and I realized that it was a ubiquitous issue among enterprises. The symptoms were most obvious when organizations addressed certain challenges including: legal discovery, storage management, compliance auditing, and risk analysis and management. People would tell us, "I have an enormous amount of data, a large number of locations where the data is stored (servers and collaboration platforms like wiki's and SharePoint farms), and the data is in formats that I no longer have the applications for (WordPerfect or Multimate)." The same issues were being raised over and over again by IT professionals:

  1. I don't know what I have and it will cost too much and take too long to figure it out.
  2. I don't know where specific information is stored/located and
  3. I don't know how to identify important or valuable content and I cannot tell what data relates to my business and what is irrelevant.

As a result of this input, it became clear that any effective unstructured data management technology solution would need to overcome three very difficult technical challenges:

  • Volume complexity: the architecture would need to scale to handle Petabytes of data.
  • Location complexity: the approach and accompanying architecture must handle data in disparate locations--without the need to move the data to analyze or manage it.
  • Format complexity: the technology must deal with and discern data in various formats and use language-neutral algorithms for analysis (because in a global business world it made no sense to address western language character formats and use analysis approaches that were valid for English or a few Romance languages only).

When we set out to build our unstructured data management platform, the first thing we did was analyze existing approaches--what worked, what didn't work. We concluded that these three areas of complexity were not addressed at the architectural level, from the outset, in any of the solutions we found. Some were built around a search engine or a natural language processing technique that had emerged from a project where an algorithm was selected and proven on a single server machine/with a small amount of data. They were not able to address "volume complexity" because it was not a core consideration when the underlying architecture was designed and built. This is why many of the search engines available for enterprise use have issues with stability above ten terabytes of data. We reached the conclusion that we should consider a different approach. We started thinking about new ways to create indices that could scale to efficiently address Petabyte-sized data collections.

We also concluded that given the need to identify and organize large, disparate collections of data (the contents of which were mostly unknown to the enterprise); a solution relying strictly on search could not provide optimum results. This is where the idea of letting data "speak for itself" started to take shape. We chose to represent the data in a mathematical way because it would allow us to determine the natural relationships to other data items--without the distractions of language-based approaches. This is not to say that search or keywords don't matter. Keywords divined by users are still an important element when managing unstructured data. But we view the problem in a different light. We believe that if users can see what categories of data they have and understand how search results relate to other data objects they will be able to locate the right data more effectively and much, much more quickly. That is why the solution we designed has a keyword search facility that shows results and shows what other data objects relate to the documents found with conventional search techniques. It takes both types of functionality to address true volume complexity. When you are dealing with massive data sets the data must "speak for itself" and relate to other items in the collection. Digital Reef has a "similarity engine" at its core. One of its key uses is to point out the relationships between like-items in large data stores. 

When you are dealing with massive collections of data, it is imperative to analyze the data in place. Our idea was to use mathematical "handles" to the data to arrange "views" into massive file stores and point out relationships among data objects. There is no need to manage anything but data descriptions until a reviewer, analyst, or administrator wishes to move the data from one storage location to another or convert the data from one format to another. Using our approach, the user is able to see every server (including SharePoint servers) in the enterprise through a "single pane of glass"; identify what data is relevant to a legal matter, compliance audit, or FDA submission; then collect that data with the click of a mouse. This approach solves the location complexity-problem for legal hold, FDA submission projects, and compliance auditing, etc.

To overcome the third major challenge, format complexity, the user must be able to convert data into useful formats from a single console. If a user has relevant legal documents in Lotus Notes, or outdated formats like WordPerfect, their solution must enable them to convert these files to plain text; PDF format or HTML. The Digital Reef platform allows data to be easily converted into useful formats so that it can be managed.

All of the functionality discussed above represents what we believe to be the "must haves" in a platform for addressing the management of unstructured data. Unless every data object can be identified by its relationship to other data objects (as we do with our similarity engine); too much human review time is required to make sense of large piles of data. Our research tells us that most enterprises don't feel that it is possible to manage unstructured data today, because it is a very difficult problem to solve and they have never seen a single platform capable of performing the functions described above. Obviously, as you can see from the descriptions of the Digital Reef platform, we feel that we have accomplished just that.

In short, we make data actionable --which is what other platforms do not accomplish. I will be talking more about this in future blog posts. This post has gotten a bit lengthy, but it's a complicated subject area. Hopefully, I've met my goal of providing an in-depth look at the challenges of unstructured data management as well as the thinking behind the Digital Reef approach.