A blog focusing on unstructured data - topics that address the challenges, best practices, and technologies

Navigating Unstructured Data

Links to blogs and articles worth taking a look at

I recently came across some blog posts that you may find interesting and/or helpful. They each reference real-world problems in the areas of compliance or unstructured data management. They don't focus on the same topic, but they do refer to the same problem--why managing unstructured data assets is difficult or important or both. And, they all relate, in one way or another, to the concepts of data volume complexity, location complexity, and format complexity mentioned in my last blog post.

The first is an excellent blog entry that explains some issues around SharePoint servers. These powerful and useful collaboration servers are causing a number of problems within enterprise environments. They present a great way of aggregating content, but as they grow in popularity, the amount of content spread across a large number of servers creates litigation discovery issues and compliance risk. As employees move from project to project, the amount of data they place in SharePoint environments grows. This ever expanding content presents a risk to litigation discovery professionals who need to find all copies of existing content and compliance officers who need to ensure that all of these large repositories comply with content directives (content that is confidential is secured; certain topics are not present in community archives; and proper retention policies are in place for content that must be retained.

Both the volume and location complexity axes of the problem exist with SharePoint. There is a lot of data in multiple locations and it is tedious if not impractical (or impossible) to search. The only way to know what content exists in the SharePoint "universe" is to use an intelligent classification device that can organize and present it to reviewers in a useful way (The Digital Reef "single pane of glass" approach provides a single view of all data stored within a SharePoint environment).

This Byte and Switch blog, detailing Mimosa functionality, is interesting from my perspective because Mimosa and Digital Reef used together could provide the capability to implement "content specific archives" so that content of certain types could be archived intelligently.

This next blog entry from IT Toolbox is focused on information management. This post has been around for a while, but it is still valid because the classification and categorization problems standing in the way of ILM are just being solved today. We (Digital Reef) feel that the missing link in the ILM strategies of most organizations is that they don't know what data/content exists in their environments. Meaningful ILM requires intelligence and scale. We are working with some customers to solve these problems now. More on that topic in future blog posts.

The last blog entry, Solving the Enterprise Search Dilemma, is from Tony Asaro. It points out how the old technology of keyword search cannot be used to implement the vision of efficient SharePoint management or anything as aggressive as ILM.

Labels: , , ,

Digg it   |   del.icio.us   |   reddit   |   Add to Technorati Faves

Why We (All) Need to Pay Attention to Unstructured Data

Unstructured data is growing at a greater rate than any other form of enterprise data
(see figure below: consumption of enterprise storage). And volume is not the only issue. There is also risk--litigation risk, compliance risk, and security risk--associated with the unstructured data stored on servers scattered all over corporations and government organizations. When I talk to enterprises, one of the most prevalent statements I hear is, "We don't know what we have, because we don't have any reasonable way to determine what is in our unstructured data".

When it comes to structured data (data in databases), enterprises know what they have. But unstructured data is exactly what its name implies "unstructured" --which makes it very difficult to get a handle on. In an attempt to control it, users rely on contrived structure known as taxonomies. To create taxonomies, they use products that are not exact (are error prone) and do not account for the dynamic nature of unstructured content. To add to the challenge, unstructured content is constantly changing. Users download content from the Internet (both appropriate and inappropriate content) and save it on hard drives. These users create documents and emails by the thousands and send them to other users (literally) at near light speed. Every day the unstructured data component of the risk equation grows, morphing and taking the shape of whatever is happening at that moment in the business.

Unstructured data is fraught with risk and it is changing constantly--a data management nightmare. And, as I mentioned above, traditional tools use a pre-defined taxonomy that requires very specific expertise to create. Unfortunately for users, this approach isn't practical in a world where the content in the data changes hourly.

So, on one side of the coin is the risk associated with unstructured data. On the other side of the coin there is value. Until an enterprise understands what intellectual property and other valuable unstructured data assets they have on their servers, in their SharePoint environments, and in their storage infrastructure; they cannot leverage the expertise of their own people that is locked away, hidden somewhere within their own company. The need to realize the maximum value of data makes unstructured data management tools that can handle and present enormous volumes of data in the context of a user's interests a must have in today's data environment.

I also want to touch on traditional keyword search, because the current "state of the art" in keyword search exposes the inadequacy of data analysis as most people know it today. What is required is a more comprehensive and useful view into the unstructured data--one that can grow as the enterprise data pile grows and can provide a foundation for search that makes it more effective and accurate.

Next time, I'll take a look at some technology issues that make this problem a difficult problem to solve. I'll also touch on the Digital Reef approach, including a look at our similarity engine and why it is designed the way it is. I'll talk about why unstructured data should "speak for itself" if an enterprise really wants to gain control over it.

I'll be interested to hear your thoughts.

Labels: , , ,

Digg it   |   del.icio.us   |   reddit   |   Add to Technorati Faves

Inaugural Blog

I am the founder and CEO of Digital Reef, an enterprise software company that has recently emerged from stealth mode after two years of intensive design and development work on our unstructured data management platform. I started the company for two reasons. First, I discovered that the management of unstructured data was an unsolved problem at every large company I came in contact with. Second, I realized that it was unsolved because it was a technically difficult problem. I am a business person and a technologist. I don't believe in technology for technology sake. My career has been devoted to solving large-scale business problems with technology. Presented with a problem like the growth and complexity of understanding and working with unstructured data, I was hard pressed to let it go.

When I first encountered this problem it was presented to me as a large and growing issue that emerged from the Sarbanes-Oxley mandates of the early part of this decade. As a result of this legislation, and other regulations springing from issues with organizations caught in the largess of the dot-com "crash", companies had been forced to save vast stores of electronic business records. The obvious consequence of complying with these regulations was that the tools that existed were woefully underpowered for evaluating the content within these vast stores of information.

This content evaluation task was strangely similar to others I had encountered while working with scientists at Bell Laboratories, back when I was CTO of Lucent Technologies' Wireline Business Unit. There, we learned that evaluating large amounts of content to identify insidious threats to network and server infrastructure required new approaches and functionality that did not exist in current network security solutions. The same concepts can be applied to understanding large content stores. That is what we are doing at Digital Reef: making content easier to understand and manage.

I am embarking on this blog in the hopes of sparking discussion around technology solutions to difficult business problems. My plan is to blog about business challenges created by unstructured data--topics including eDiscovery, data storage, knowledge reuse, data security, compliance and data governance, to name a few. I also plan to provide my assessment of some of the technology solutions out there today and my thoughts about what is coming in the future.

I look forward to getting the discussion started.

Labels: , , , , , , ,

Digg it   |   del.icio.us   |   reddit   |   Add to Technorati Faves