The Great Divide
Databases Are So 20th Century
by Dave Kellogg
02-Jan-2006

There's a big divide that exists in information technology. One that we're not supposed to talk about. One that we pretend doesn't exist, but in fact gets wider with each passing year -- it's the divide between data and content.
Data is a first-class citizen in the IT world. Data has a nice home. It lives in databases that offer control, consistency, security, backup/recovery, indexing, and a query mechanism.
Most content, on the other hand, is homeless, relegated to the file system. If you're lucky you're using a search engine to index content, so you can find files containing certain words or phrases. Or, with some additional setup, you can index a few XML tags, if present, and run fielded searches against them. But for most content, the full benefits of living in a database -- such as powerful, fined-grained queries, transaction consistency, and immediate availability -- just aren't available.
As a result, our understanding of content has become comparatively impoverished, our expectations for what is possible with content are reduced, and a myth gets perpetuated that content can and should be managed with the same tools and approaches as data.
Isn't ECM About Content?
Just as poor countries have rich residents, there is an upper class of content that gets to live in databases (e.g., corporate web content, aircraft repair manuals, new drug applications). Typically, this is accomplished through enterprise content management (ECM) systems that both break the content into bite-sized morsels that fit into relational "square tables" and track metadata about it (e.g., author, version, check-in status, required approvals).
So while upper-class content enjoys life in a database, the great irony is that ECM typically treats the content itself as opaque -- because of the limitations in the underlying database system. That is, while ECM tracks and manages a lot of information about the content, it actually does relatively little to help get inside content. Despite its middle name, ECM today isn't really about content. It's about metadata.
By analogy, an ECM system is a bit like a database system that provides reports, but not queries. You can see this report or that report. This report was made by Bob, that one by Sally. This one is version 2, the other is version 3. This one needs to be approved by Joe before official publication. But you'd have no ability to get in at finer level, run your own queries, or change the reports to roll-up different data in different ways. That's because in an ECM repository, the content is typically opaque and the system is simply tracking information about it.
Sure, some systems let you automatically break up or "chunk" your content so you can have more granular, yet nonetheless still opaque, pieces. But these approaches require content that is absolutely regular in schema (which his almost never the case) and create a "humpty dumpty" problem. Once you break your content up, can you re-assemble it? Most often, the answer is actually no, and there is information loss in the "round tripping" of content between its native and shredded form. In addition, even for those who chose chunking, the performance implications of working with chunked content drive most users to pick highly coarse-grained chunks anyway. It's simply too slow and cumbersome in most systems to try and work with content at a fine-grained level.
Return of the BLOB
That 80% of the world's information is unstructured and the vast majority not stored in databases is not lost on the relational database mafia. They've been trying for decades to solve what they consider "the unstructured data problem." First, they offered completely opaque binary large objects, or "BLOBs" (which were actually pointers back to...guess where: the file system). Then, they offered character large objects ("CLOBs") that had some basic text search capabilities.
I was in Radio City Music Hall in New York City at the Oracle8 launch in 1997 when Larry Ellison declared the death of files. But files lived stubbornly on.
I read much about WinFS, Microsoft's attempt at solving the problem. Owning both a DBMS and an operating system, Microsoft approached the problem differently. Instead of trying to make better DBMS, why not make a better, XML-based file system? Originally slated for Cairo and now for Vista, waiting for WinFS has been like waiting for Godot. Why? Part of the delay is undoubtedly Microsoft's famously clotted development process, but part must come from the management constraint that WinFS be implemented on top of SQL Server. Relational databases are terrible at modeling hierarchy. Modeling a hierarchical file system in which each file itself has a rich, internal hierarchy of nodes is simply a lousy application for a relational database.
The latest attempt at eliminating the great divide is the addition of XML types to relational databases. This approach dates back to abstract datatypes, originally implemented in Postgres, first commercially available in Ingres, later re-branded as datablades by Illustra, and then sold off to Informix to launch the "universal" database hype of the mid 1990s.
The idea was simple. If certain types of data didn't fit into databases -- and if you assumed the problem couldn't be the relational model itself -- then the problem had to be an implementation limitation, specifically the absence of datatypes. That is, tables and columns were perfect for modeling anything, just as long as you had an infinite supply of types available for columns. If you had messages, create a message type. If you had spatial coordinates, create a spatial coordinate type.
While few customers actually used abstract datatypes, that didn't stop the
RDBMS vendors from touting them as the latest solution to the unstructured data
problem. All of the major RDBMS vendors are in the midst of implementing XML
as a type, so you can create a table called DOCUMENTS with an INTEGER
column called DOC-ID and an XML column called DOCUMENT.
Content before data
I don't believe this latest attempt will eliminate the great divide because the RDBMS vendors are again starting with the wrong question. They are (for the fourth time) starting with the question, "how I can fit unstructured content into my existing databases," when they should be starting with the question, "how is content different from data and how would I make a data store where content is a first-class citizen?"
I'll answer the question myself:
- First, you need to abandon the notion that content is a special case of data. Indeed, it's the other way around; data is a special case of content that happens be highly regular in structure.
- Second, you need to recognize that content, and particularly XML content, is strictly hierarchical in structure. So the enabling technologies must model hierarchy well.
- Third, content needs to be taken "as is" or you will never be able to load it. This is the great lesson of search engines. While they have many limitations when seen through a database lens, their one great strength is that they take content "as is." Because content comes from so many sources, because there is so much of it, you simply cannot require preparation or transformation as a precursor to loading content.
- Fourth, you'll need to build a full-fledged search engine into the DBMS because users will want to run queries that combine structured and unstructured fields (e.g., return the first paragraph of all documents authored by Bob and containing the word tendonitis.) Traditional b-tree key indexing is not enough, nor are independent text and XML indexing efforts.
- Fifth, you'll need to build a system that can handle the many ambiguities inherent in content that don't exist in the simpler world of data, such as stemming (Dave vs. David), synonyms (fracture vs. break), taxonomy (fruit vs. apple), and source language (apple vs. pomme).
Only when we start treating content as first-class citizen in designing the next generation of database systems will we be able to eliminate the great divide, enable the coming generation of content applications, and reap the value from all our information, not just the 20% that can be represented as structured data. Just as our kids listen to iTunes today and ask us what "records" were, so one day they will query contentbases and ask us what databases were.
"Dad, you mean that you had contentbases that could only handle numerical and short-text fields? Lame." Databases were so 20th century.


