• Home
  • Research
  • What We Offer
  • Who We Are
  • Blog
  • Your cart is empty.
  • Log in
  • Subscribe
  • Contact Us
  • Recent Entries
  • Get Custom Feeds
Team Blog
Free Research Sample
Bloem

Search the X-Files: unknown entities

Added By Adriaan Bloem at 21-Sep-2007 | Twitter: @adriaanbloem |

If you're in the market for search technology, you probably hear a lot about faceted browsing, guided navigation, refining, clustering, categorization, and so on. Many of today's search engines attempt to present more than just keyword search. That's fine if your content has high-quality structured metadata, but what if you throw in thousands of Word documents where the "author" is defined as John Doe? The truth may be out there -- but the answer is buried deep in the text.

Distilling things like people, email addresses, and company names from source content is what is known as "entity extraction." Vendors may tell you that yes, their search interface pivots off that kind of data (e.g., for guided navigation), but don't worry: they can extract the unknown entities even if you throw in large files nobody ever bothered to tag right. Enterprise search will create and then reveal structure where once there was chaos.

Of course, this is not at all the black magic it is made out to be. Finding relevant entities is usually accomplished through a combination of pattern-matching and dictionaries. An email address will contain the "@" symbol, and it's pretty safe to say that if it's followed by a dotted domain name, you've got your address. If "John" is in your dictionary of first names, the next capitalized word will probably be the surname. This also means that entity extraction is language- and even country-specific. A representative of Fast Search & Transfer's professional services told me about the challenges the company faced finding a fail-safe way of distilling German street addresses, which have a very different and much less formal structure than those in, say, North America.

Many vendors, of course, won't like you to be distracted with the details of their "automagical" ways of achieving this. Their method may be English- and US-specific, but hey, so what -- if your company is based in the US and content comes in English, you're fine. In reality, things are never that easy though.

I was running a test of ISYS:web against the CMS Watch website, and was pleasantly surprised to see the out-of-the-box installation correctly identified several countries, and had no problems finding out that Tony Byrne is an actual person. It even managed to extract Janus Boye's somewhat more exotic Danish name. Unsurprisingly, Apoorv Durga was a bit too outlandish and my ego wasn't hurt when Adriaan Bloem wasn't ranked among the people. But you really don't want to provide Theresa Regli with cannon fodder by ignoring her (which it did), while on the other hand, I can't recall ever having met "Read More," international man of mystery, now a full-fledged person in my search engine.

This is not to say you should bash ISYS for this -- the company is the first to admit its methods aren't infallible, and many vendors at a much higher price point don't even offer similar technology, instead relying on third-party tools. What it does mean, however, is you shouldn't take claims that "it's all taken care of" at face value. Investigate whether languages and countries relevant to you are supported, and better still, test against your own content. Then assume you are committing yourself to near constant system training and tweaking.

Failing that, some search products will allow you to specify additional criteria (with ISYS, for instance, "Theresa" was easily added with a [pre] construct in a text-based configuration file). Others enable you to define completely new entities and patterns from scratch (such as FAST's processing in Python, or Endeca's XSL and Perl). Be very aware, though, that sending in a Mulder agent to investigate your X-Files might be a costly, ongoing adventure, lasting nine seasons of suspense.

Categories: Adriaan Bloem, Search and Information Access, Industry Standards, Information Architecture, Selecting Technology, Endeca Information Access Platform, FAST ESP, ISYS Search Suite

  • Tweet This Entry

Online Education

Check out our classes and Register Today.

Evaluation Research

Get the real story about vendors and products.

My Research

Remember MeForgot password?

Not a subscriber? Learn about our subscriptions

Categories

Channel

  • Collaboration & Community Software (123)
  • Web Analytics (148)
  • Web Content Management (796)

Analyst

  • Adriaan Bloem (44)
  • Tony Byrne (659)
  • Apoorv Durga (7)
  • Jarrod Gingras (30)
  • Alan Pelz-Sharpe (59)
  • Theresa Regli (36)
  • Kas Thomas (77)

Topics

  • Asia-Pacific Marketplace (3)
  • Building Business Case (139)
  • Cloud Computing (4)
  • E-Discovery (1)
  • European Marketplace (15)
  • Governance (10)
  • Implementation (210)
  • Industry Events (1)
  • Industry Standards (110)
  • Information Architecture (84)
  • Intranets (6)
  • Marketplace at Large (502)
  • Open Source (93)
  • Selecting Technology (542)
  • Services Oriented Architecture (4)
  • Software-as-a-Service (16)
  • Usability (3)
  • Vendor Viability & Financials (128)
  • XML (28)

Industries

  • Finance (1)
  • Government (17)
  • Health Care (1)
  • Higher Ed (7)
  • Manufacturing (2)
  • Publishing-Media (4)
  • Retail (4)

Dates

  • 2010 (55)
  • 2009 (200)
  • 2008 (223)
  • 2007 (166)
  • 2006 (99)
  • 2005 (104)
  • 2004 (58)
  • 2003 (67)
  • 2002 (67)
  • 2001 (28)

Have Questions?

Sales & Customer Support

+1 800 325 6190 (USA)+44 (0) 20 3318 1911 (UK)+1 617 340 6464 (Int'l)sales@realstorygroup.com support@realstorygroup.com

All other inquiries: info@realstorygroup.com

Copyright, 2001 - 2010, Real Story Group. All rights reserved.

  • Contact Us
  • Copyright Policy
  • Privacy Policy
  • Terms of Use

The Real Story Group

  • CMS Watch
  • Enterprise Information
       Watch
  • SharePoint Watch
  • The Real Story Group

Research

  • Vendor Evaluations
  • Webinars & Advisory Papers
  • Online Education
  • Vendor Lists
  • Free Research Sample
  • Purchase Now

What We Offer

  • Research & Advisory
       Services
  • Frequently Asked Questions
  • Consulting Services
  • Customer Support
  • Contact Sales Team

Who We Are

  • We're Different
  • Our Team
  • Media
  • Customer List
  • Events
  • Contact Us

Get the real story via our bi-weekly newsletter.

Follow us on: RSS twitter

Log In

Remember MeForgot password?