Get the real story via our monthly newsletter

Search

    2
    0

rss

Send to a colleague

Home > Commentary > Trends Archive > Content cleanup in the former East Germany

Browse TrendWatch Blog

Recent Blog Entries

The Complete Archive

Trends by Vendor


TrendWatch by Channel

Web Content Management Trends

Enterprise Portals Trends

ECM Trends

Web Analytics Trends

Enterprise Search Trends

SharePoint Trends

Digital & Media Asset Management Trends

XML & Component Content Management Trends


Report Excerpt

The XML & Component Content Management Report 2008 looks at...

"If vision is one side of the vendor's organizational coin, then execution is the other. Buyers tell us they want stability and predictability from their software suppliers. In a fast-moving marketplace, that can be hard to find..."

(p. 49)

More about The XML & Component Content Management Report 2008

 

TrendWatch Blog

Content cleanup in the former East Germany

26-Dec-2007

There's no time like the holidays for catching up on back issues of The Economist (don't worry, we're baking cookies, too), and this morning I found myself engrossed by a tale of pattern matching. No, not pattern matching of snowflakes or Christmas knits, but of a set of documents ripped into 600 million pieces by East Germany's State Security Service (better known as the Stasi), back when the Berlin wall was being torn down and the mob was at the gates. The Stasi were afraid of documents falling into the wrong hands, so when the shredders failed, they frantically resorted to tearing up documents piece by piece. And you thought getting your enterprise search engine to pull off late-binding security was tough?

In a project currently underway at Berlin's Fraunhofer Institute for Production Systems and Design Technology, software is being used to find patterns in these millions of Stasi-created fragments of paper and re-assemble them, jigsaw-puzzle style. In going through the fragments, the software is grouping the scanned shreds of paper together by identifying patterns in handwriting, color, paper texture, even ink color. Then, once a group of related shreds is found, the software puzzles the papers together. In their haste, the Stasi actually helped this process quite a bit -- most of the fragments of the same document were found in the same bag. Or bucket. Category. Taxonomy facet, if you will.

Like enterprise search tools that perform some sort of text mining and subsequent clustering -- such as Autonomy, FAST or Endeca -- this software has the capacity to learn and refine what it puts together, identifying new content as more or less like the original items in the set. When it gets confused (such as when a document has distorted or torn edges), it refers the act of judgement to a human being. But what's especially interesting about this software is that it actually spawns slightly altered versions of itself that compete for computer time on the basis of success at finding matches. Now that's something I'd love to see from my local enterpise search vendor.

There's a few lessons to be learned here. First, this is a multi-year project with dedicated resources, which is more than most companies are willing to commit to their own document scanning and indexing efforts. Second, while pattern matching may seem like an exact way to search for things, there's always factors in play that require judgement and refinement -- be it subtle linguistic differences, synonyms, or even how someone happened to tear something up.

And finally -- although history will surely welcome the Stasi's carelessness -- you should never take content security and storage lightly. You may think content is "secure enough," until you realize just how good your new enterprise search tool is at indexing all your content, but how bad it is at tying into your ACLs and showing the right results only to those who should see them.

Now, why can't I get my snowflake cookies to all look exactly alike?

- Submitted by: Theresa Regli, Analyst

All CCM Channel Trends

Join the conversation

Digg This! Search Technorati Tag it on Del.icio.us



Get a Free Sample

Wondering about CMS Watch research? Sign up to receive free samples of any of our products.




What we do

CMS Watch™ evaluates content-oriented technologies, publishing head-to-head comparative reviews of leading solutions. What makes us special?

  • Our critical analysis exposes product weaknesses as well as strengths
  • We deliver unrivaled technical depth and comprehensive project advice
  • Our research is led by international topic experts
  • We only work for buyers -- never for vendors

Contact us

CMS Watch

info@cmswatch.com

18113 Town Center Drive, Ste 217

Olney, MD USA 20832

1 800 325 6190 (N. America only)

+1 617 763 5336 (customer service)

+1 301 585 7004 (editorial)

Fax: +1 214 242 3048