Get the real story via our monthly newsletter

Search

    2
    0

rss

Send to a colleague

Home > Commentary > Trends Archive > Search the X-Files: unknown entities

Browse TrendWatch Blog

Recent Blog Entries

The Complete Archive

Trends by Vendor


TrendWatch by Channel

Web Content Management Trends

Enterprise Portals Trends

ECM Trends

Web Analytics Trends

Enterprise Search Trends

SharePoint


Report Excerpt

The Enterprise Search Report 2008 looks at... ISYS: ISYS:web

"Don't expect fancy linguistics and classification, but if you have strong ideas about how to combine and distribute several disparate indexes, and enjoy a do-it-yourself approach of implementing technology without reliance on an integrator or professional services, have a look at ISYS. "

(p. 221)

More about The Enterprise Search Report 2008

 

TrendWatch Blog

Search the X-Files: unknown entities

21-Sep-2007

If you're in the market for search technology, you probably hear a lot about faceted browsing, guided navigation, refining, clustering, categorization, and so on. Many of today's search engines attempt to present more than just keyword search. That's fine if your content has high-quality structured metadata, but what if you throw in thousands of Word documents where the "author" is defined as John Doe? The truth may be out there -- but the answer is buried deep in the text.

Distilling things like people, email addresses, and company names from source content is what is known as "entity extraction." Vendors may tell you that yes, their search interface pivots off that kind of data (e.g., for guided navigation), but don't worry: they can extract the unknown entities even if you throw in large files nobody ever bothered to tag right. Enterprise search will create and then reveal structure where once there was chaos.

Of course, this is not at all the black magic it is made out to be. Finding relevant entities is usually accomplished through a combination of pattern-matching and dictionaries. An email address will contain the "@" symbol, and it's pretty safe to say that if it's followed by a dotted domain name, you've got your address. If "John" is in your dictionary of first names, the next capitalized word will probably be the surname. This also means that entity extraction is language- and even country-specific. A representative of Fast Search & Transfer's professional services told me about the challenges the company faced finding a fail-safe way of distilling German street addresses, which have a very different and much less formal structure than those in, say, North America.

Many vendors, of course, won't like you to be distracted with the details of their "automagical" ways of achieving this. Their method may be English- and US-specific, but hey, so what -- if your company is based in the US and content comes in English, you're fine. In reality, things are never that easy though.

I was running a test of ISYS:web against the CMS Watch website, and was pleasantly surprised to see the out-of-the-box installation correctly identified several countries, and had no problems finding out that Tony Byrne is an actual person. It even managed to extract Janus Boye's somewhat more exotic Danish name. Unsurprisingly, Apoorv Durga was a bit too outlandish and my ego wasn't hurt when Adriaan Bloem wasn't ranked among the people. But you really don't want to provide Theresa Regli with cannon fodder by ignoring her (which it did), while on the other hand, I can't recall ever having met "Read More," international man of mystery, now a full-fledged person in my search engine.

This is not to say you should bash ISYS for this -- the company is the first to admit its methods aren't infallible, and many vendors at a much higher price point don't even offer similar technology, instead relying on third-party tools. What it does mean, however, is you shouldn't take claims that "it's all taken care of" at face value. Investigate whether languages and countries relevant to you are supported, and better still, test against your own content. Then assume you are committing yourself to near constant system training and tweaking.

Failing that, some search products will allow you to specify additional criteria (with ISYS, for instance, "Theresa" was easily added with a [pre] construct in a text-based configuration file). Others enable you to define completely new entities and patterns from scratch (such as FAST's processing in Python, or Endeca's XSL and Perl). Be very aware, though, that sending in a Mulder agent to investigate your X-Files might be a costly, ongoing adventure, lasting nine seasons of suspense.

- Submitted by: Adriaan Bloem, Contributing Analyst

All Search Channel Trends

Join the conversation

Digg This! Search Technorati Tag it on Del.icio.us



Get a Free Sample

Wondering about CMS Watch research? Sign up to receive free samples of any of our products.




What we do

CMS Watch™ evaluates content-oriented technologies, publishing head-to-head comparative reviews of leading solutions. What makes us special?

  • Our critical analysis exposes product weaknesses as well as strengths
  • We deliver unrivaled technical depth and comprehensive project advice
  • Our research is led by international topic experts
  • We only work for buyers -- never for vendors

Contact us

CMS Watch

info@cmswatch.com

18113 Town Center Drive, Ste 217

Olney, MD USA 20832

1 800 325 6190 (N. America only)

+1 617 763 5336 (customer service)

+1 301 585 7004 (editorial)

Fax: +1 214 242 3048