• Home
  • Research
  • What We Offer
  • Who We Are
  • Blog
  • Your cart is empty.
  • Log in
  • Subscribe
  • Contact Us
  • Recent Entries
  • Get Custom Feeds
Team Blog
Free Research Sample
Regli

Content cleanup in the former East Germany

Added By Theresa Regli at 26-Dec-2007 | Twitter: @TheresaRegli |

There's no time like the holidays for catching up on back issues of The Economist (don't worry, we're baking cookies, too), and this morning I found myself engrossed by a tale of pattern matching. No, not pattern matching of snowflakes or Christmas knits, but of a set of documents ripped into 600 million pieces by East Germany's State Security Service (better known as the Stasi), back when the Berlin wall was being torn down and the mob was at the gates. The Stasi were afraid of documents falling into the wrong hands, so when the shredders failed, they frantically resorted to tearing up documents piece by piece. And you thought getting your enterprise search engine to pull off late-binding security was tough?

In a project currently underway at Berlin's Fraunhofer Institute for Production Systems and Design Technology, software is being used to find patterns in these millions of Stasi-created fragments of paper and re-assemble them, jigsaw-puzzle style. In going through the fragments, the software is grouping the scanned shreds of paper together by identifying patterns in handwriting, color, paper texture, even ink color. Then, once a group of related shreds is found, the software puzzles the papers together. In their haste, the Stasi actually helped this process quite a bit -- most of the fragments of the same document were found in the same bag. Or bucket. Category. Taxonomy facet, if you will.

Like enterprise search tools that perform some sort of text mining and subsequent clustering -- such as Autonomy, FAST or Endeca -- this software has the capacity to learn and refine what it puts together, identifying new content as more or less like the original items in the set. When it gets confused (such as when a document has distorted or torn edges), it refers the act of judgement to a human being. But what's especially interesting about this software is that it actually spawns slightly altered versions of itself that compete for computer time on the basis of success at finding matches. Now that's something I'd love to see from my local enterpise search vendor.

There's a few lessons to be learned here. First, this is a multi-year project with dedicated resources, which is more than most companies are willing to commit to their own document scanning and indexing efforts. Second, while pattern matching may seem like an exact way to search for things, there's always factors in play that require judgement and refinement -- be it subtle linguistic differences, synonyms, or even how someone happened to tear something up.

And finally -- although history will surely welcome the Stasi's carelessness -- you should never take content security and storage lightly. You may think content is "secure enough," until you realize just how good your new enterprise search tool is at indexing all your content, but how bad it is at tying into your ACLs and showing the right results only to those who should see them.

Now, why can't I get my snowflake cookies to all look exactly alike?

Categories: Theresa Regli, Component Content Management, Search and Information Access, Implementation, Information Architecture, Endeca Information Access Platform, FAST ESP, IDOL Server

  • Tweet This Entry

Online Education

Check out our classes and Register Today.

Evaluation Research

Get the real story about vendors and products.

My Research

Remember MeForgot password?

Not a subscriber? Learn about our subscriptions

Categories

Channel

  • Collaboration & Community Software (126)
  • Web Analytics (151)
  • Web Content Management (802)

Analyst

  • Adriaan Bloem (44)
  • Tony Byrne (661)
  • Apoorv Durga (8)
  • Jarrod Gingras (33)
  • Alan Pelz-Sharpe (65)
  • Theresa Regli (36)
  • Kas Thomas (77)

Topics

  • Asia-Pacific Marketplace (3)
  • Building Business Case (142)
  • Cloud Computing (5)
  • E-Discovery (1)
  • European Marketplace (15)
  • Governance (14)
  • Implementation (218)
  • Industry Events (1)
  • Industry Standards (111)
  • Information Architecture (84)
  • Intranets (6)
  • Marketplace at Large (504)
  • Open Source (93)
  • Selecting Technology (547)
  • Services Oriented Architecture (4)
  • Software-as-a-Service (17)
  • Usability (7)
  • Vendor Viability & Financials (128)
  • XML (28)

Industries

  • Finance (2)
  • Government (21)
  • Health Care (2)
  • Higher Ed (7)
  • Legal (1)
  • Manufacturing (2)
  • Pharma (1)
  • Publishing-Media (4)
  • Retail (7)

Dates

  • 2010 (67)
  • 2009 (200)
  • 2008 (223)
  • 2007 (166)
  • 2006 (99)
  • 2005 (104)
  • 2004 (58)
  • 2003 (67)
  • 2002 (67)
  • 2001 (28)

Have Questions?

Sales & Customer Support

+1 800 325 6190 (USA)+44 (0) 20 3318 1911 (UK)+1 617 340 6464 (Int'l)sales@realstorygroup.com support@realstorygroup.com

All other inquiries: info@realstorygroup.com

Copyright, 2001 - 2010, Real Story Group. All rights reserved.

  • Contact Us
  • Copyright Policy
  • Privacy Policy
  • Terms of Use

The Real Story Group

  • CMS Watch
  • Enterprise Information
       Watch
  • SharePoint Watch
  • The Real Story Group

Research

  • Vendor Evaluations
  • Webinars & Advisory Papers
  • Online Education
  • Vendor Lists
  • Free Research Sample
  • Purchase Now

What We Offer

  • Research & Advisory
       Services
  • Frequently Asked Questions
  • Consulting Services
  • Customer Support
  • Contact Sales Team

Who We Are

  • We're Different
  • Our Team
  • Media
  • Customer List
  • Events
  • Contact Us

Get the real story via our bi-weekly newsletter.

Follow us on: RSS twitter

Log In

Remember MeForgot password?