Get the real story via our monthly newsletter

Search

    2
    0

rss

Send to a colleague

Home > ECM Suites > Databases Are So 20th Century

Get a Free Sample

Wondering about CMS Watch research? Sign up to receive free samples of any of our products.

Report Excerpt

The ECM Suites Report 2009 looks at... Alfresco ECM Enterprise

"Although the software itself is freely available for download, when you factor in the inevitable services work required for a complex application, along with the relatively steep maintenance and support fees that Alfresco itself charges, an Alfresco-driven WCM project may only be no less costly than a commercial competitor's offering. "

(p. 715)

More about The ECM Suites Report 2009

Our customers say

"The analysis of the current technology vendors and products is very comprehensive and it provides an excellent guide for potential purchasers to frame their functional, architectural and usability benchmarks.
- - Len Asprey. Director, Practical Information Management Solutions, and,
Author, Integrative Document and Content Management

NEW at CMS Watch

The ECM Suites Report 2009 The ECM Suites Report 2009: This report evaluates 30 ECM offerings... Read more
ECM Education ECM Technology Online Courses: Alan Pelz-Sharpe instructs students on ECM Technology...Read more
jboye08 Join us in Denmark at jboye08: On November 4, CMS Watch will teach tutorials on Web Content Management, Enterprise Social Software, and SharePoint... Read more

Glossary

Controlled Vocabulary

Document Management

Field Query

Free Text Query

Index

Keyword Search

Lemmatization

Metadata

Object-oriented database

Pattern Matching

RDBMS

Stemming

Structured Data

Taxonomy

Unstructured Information

XML



 

The Great Divide

Databases Are So 20th Century

by Dave Kellogg
02-Jan-2006

There's a big divide that exists in information technology. One that we're not supposed to talk about. One that we pretend doesn't exist, but in fact gets wider with each passing year -- it's the divide between data and content.

Data is a first-class citizen in the IT world. Data has a nice home. It lives in databases that offer control, consistency, security, backup/recovery, indexing, and a query mechanism.

Most content, on the other hand, is homeless, relegated to the file system. If you're lucky you're using a search engine to index content, so you can find files containing certain words or phrases. Or, with some additional setup, you can index a few XML tags, if present, and run fielded searches against them. But for most content, the full benefits of living in a database -- such as powerful, fined-grained queries, transaction consistency, and immediate availability -- just aren't available.

As a result, our understanding of content has become comparatively impoverished, our expectations for what is possible with content are reduced, and a myth gets perpetuated that content can and should be managed with the same tools and approaches as data.

Isn't ECM About Content?

Just as poor countries have rich residents, there is an upper class of content that gets to live in databases (e.g., corporate web content, aircraft repair manuals, new drug applications). Typically, this is accomplished through enterprise content management (ECM) systems that both break the content into bite-sized morsels that fit into relational "square tables" and track metadata about it (e.g., author, version, check-in status, required approvals).

So while upper-class content enjoys life in a database, the great irony is that ECM typically treats the content itself as opaque -- because of the limitations in the underlying database system. That is, while ECM tracks and manages a lot of information about the content, it actually does relatively little to help get inside content. Despite its middle name, ECM today isn't really about content. It's about metadata.

By analogy, an ECM system is a bit like a database system that provides reports, but not queries. You can see this report or that report. This report was made by Bob, that one by Sally. This one is version 2, the other is version 3. This one needs to be approved by Joe before official publication. But you'd have no ability to get in at finer level, run your own queries, or change the reports to roll-up different data in different ways. That's because in an ECM repository, the content is typically opaque and the system is simply tracking information about it.

Sure, some systems let you automatically break up or "chunk" your content so you can have more granular, yet nonetheless still opaque, pieces. But these approaches require content that is absolutely regular in schema (which his almost never the case) and create a "humpty dumpty" problem. Once you break your content up, can you re-assemble it? Most often, the answer is actually no, and there is information loss in the "round tripping" of content between its native and shredded form. In addition, even for those who chose chunking, the performance implications of working with chunked content drive most users to pick highly coarse-grained chunks anyway. It's simply too slow and cumbersome in most systems to try and work with content at a fine-grained level.

Return of the BLOB

That 80% of the world's information is unstructured and the vast majority not stored in databases is not lost on the relational database mafia. They've been trying for decades to solve what they consider "the unstructured data problem." First, they offered completely opaque binary large objects, or "BLOBs" (which were actually pointers back to...guess where: the file system). Then, they offered character large objects ("CLOBs") that had some basic text search capabilities.

I was in Radio City Music Hall in New York City at the Oracle8 launch in 1997 when Larry Ellison declared the death of files. But files lived stubbornly on.

I read much about WinFS, Microsoft's attempt at solving the problem. Owning both a DBMS and an operating system, Microsoft approached the problem differently. Instead of trying to make better DBMS, why not make a better, XML-based file system? Originally slated for Cairo and now for Vista, waiting for WinFS has been like waiting for Godot. Why? Part of the delay is undoubtedly Microsoft's famously clotted development process, but part must come from the management constraint that WinFS be implemented on top of SQL Server. Relational databases are terrible at modeling hierarchy. Modeling a hierarchical file system in which each file itself has a rich, internal hierarchy of nodes is simply a lousy application for a relational database.

The latest attempt at eliminating the great divide is the addition of XML types to relational databases. This approach dates back to abstract datatypes, originally implemented in Postgres, first commercially available in Ingres, later re-branded as datablades by Illustra, and then sold off to Informix to launch the "universal" database hype of the mid 1990s.

The idea was simple. If certain types of data didn't fit into databases -- and if you assumed the problem couldn't be the relational model itself -- then the problem had to be an implementation limitation, specifically the absence of datatypes. That is, tables and columns were perfect for modeling anything, just as long as you had an infinite supply of types available for columns. If you had messages, create a message type. If you had spatial coordinates, create a spatial coordinate type.

While few customers actually used abstract datatypes, that didn't stop the RDBMS vendors from touting them as the latest solution to the unstructured data problem. All of the major RDBMS vendors are in the midst of implementing XML as a type, so you can create a table called DOCUMENTS with an INTEGER column called DOC-ID and an XML column called DOCUMENT.

Content before data

I don't believe this latest attempt will eliminate the great divide because the RDBMS vendors are again starting with the wrong question. They are (for the fourth time) starting with the question, "how I can fit unstructured content into my existing databases," when they should be starting with the question, "how is content different from data and how would I make a data store where content is a first-class citizen?"

I'll answer the question myself:

  • First, you need to abandon the notion that content is a special case of data. Indeed, it's the other way around; data is a special case of content that happens be highly regular in structure.
  • Second, you need to recognize that content, and particularly XML content, is strictly hierarchical in structure. So the enabling technologies must model hierarchy well.
  • Third, content needs to be taken "as is" or you will never be able to load it. This is the great lesson of search engines. While they have many limitations when seen through a database lens, their one great strength is that they take content "as is." Because content comes from so many sources, because there is so much of it, you simply cannot require preparation or transformation as a precursor to loading content.
  • Fourth, you'll need to build a full-fledged search engine into the DBMS because users will want to run queries that combine structured and unstructured fields (e.g., return the first paragraph of all documents authored by Bob and containing the word tendonitis.) Traditional b-tree key indexing is not enough, nor are independent text and XML indexing efforts.
  • Fifth, you'll need to build a system that can handle the many ambiguities inherent in content that don't exist in the simpler world of data, such as stemming (Dave vs. David), synonyms (fracture vs. break), taxonomy (fruit vs. apple), and source language (apple vs. pomme).

Only when we start treating content as first-class citizen in designing the next generation of database systems will we be able to eliminate the great divide, enable the coming generation of content applications, and reap the value from all our information, not just the 20% that can be represented as structured data. Just as our kids listen to iTunes today and ask us what "records" were, so one day they will query contentbases and ask us what databases were.

"Dad, you mean that you had contentbases that could only handle numerical and short-text fields? Lame." Databases were so 20th century.


Next:

Send Feedback

See all ECM Suites Channel feature articles.

Need to select a technology vendor, but confused about your choices? See our vendor-neutral technology reports.

Join the conversation

Digg This! Search Technorati Tag it on Del.icio.us



About the Author

Dave Kellogg

Dave Kellogg is President and CEO of MarkLogic, developer of MarkLogic Server, an XML content server. Kellogg is a 3-decade software and applications industry veteran, having headed up marketing at Business Objects and at Versant Object Technology (a provider of object database management systems), after holding technical and marketing positions with RDBMS vendor Ingres Corporation.



Get a Free Sample

Wondering about CMS Watch research? Sign up to receive free samples of any of our products.



What we do

CMS Watch™ evaluates content-oriented technologies, publishing head-to-head comparative reviews of leading solutions. What makes us special?

  • Our critical analysis exposes product weaknesses as well as strengths
  • We deliver unrivaled technical depth and comprehensive project advice
  • Our research is led by international topic experts
  • We only work for buyers -- never for vendors

Contact us

CMS Watch

info@cmswatch.com

18113 Town Center Drive, Ste 217

Olney, MD USA 20832

1 800 325 6190 (customer service)

+1 617 763 5336 (int'l customer service)

Fax: +1 214 242 3048