Get the real story via our bi-monthly newsletter

Search

    4
    0

rss

Send to a colleague

Home > Commentary > Trends Archive > Read me that file so I can index it, please

Browse TrendWatch Blog

Recent Blog Entries

The Complete Archive

Trends by Vendor


TrendWatch by Channel

Web Content Management Trends

Enterprise Portals Trends

ECM Trends

Web Analytics Trends

Enterprise Search Trends

SharePoint Trends

Digital & Media Asset Management Trends

XML & Component Content Management Trends

E-mail Archiving & Management Trends

Enterprise Social Software & Collaboration Trends


Report Excerpt

The Search & Information Access Report looks at... Autonomy's Mathematical Approach

"Based on statistical algorithms rather than language processing, it attempts to come up with results using "Meaning-Based Computing" through its "Dynamic Reasoning Engine." Marketing buzz describes the discovery process as "understanding" the meaning of content, (which may exaggerate its cognitive capabilities), but the system can be remarkably adept at digging up deeply buried information. Depending on your content, however, it could engender very poor relevance, even as it displays very high recall. "

(p. 92)

More about The Search & Information Access Report

Our customers say

"There are two main features of The Search & Information Access Report that keep me coming back to it as a reference. There are, of course, the reviews of the different tools which are very helpful when I need to quickly learn about a new search engine. But of even more value is its treatment of the requirements and pitfalls of search implementations in general. Highly recommended for those considering a search implementation.
- - Ron Daniel, Jr.,
Principal, Taxonomy Strategies LLC

NEW at CMS Watch

The Search and Information Access ReportThe Search & Information Access Report: This newly updated 341-page Search and Information Access Report critically evaluates 23 Search and Information Access offerings from around the globe... Read more

The Enterprise Collaboration & Community Software ReportThe Enterprise Collaboration & Community Software Report : This newly updated research critically evaluates 27 Enterprise Collaboration and Community Software products head-to-head... Read more

The Enterprise Content Management ReportThe Enterprise Content Management Report : This newly updated research critically evaluates 32 Enterprise Content Management products head-to-head... Read more

 
 

TrendWatch Blog

Read me that file so I can index it, please

08-Apr-2009   --  

One of those easy-to-overlook but important details of a search engine: will it actually read your files? You may be interested in Lucene, but you'll have to find a way to feed it Office documents and PDFs.

Search engines don't actually directly index the Word document or PDF, they index text. This is where document filters come into play. These do their best to get the text from the file (and usually some metadata, such as an "author" field). If you've ever tried to open some exotic document format in a plain text editor (i.e., Notepad, or VI) you'll understand this can be far from trivial: many of these formats aren't very straightforward.

The problem isn't just trying to find the text, there are quite a few complications: reading across two or three column layouts; what to do with footnotes; or what to index, period. Spreadsheets are troublesome, but what do you make of images, audio, video? And for many scenarios (like indexing a file share) there will be exotic file types to deal with. (I recall the comments at a municipality once: "But we don't have any exotic file types". Three months later, a full crawl unearthed a stack of CAD/CAM files that were vital for planning). To make matters worse, file formats change with the software versions that come out (will the converter read Office 2007 or just Office 95?).

Since it's complicated to build and maintain good filters, most vendors buy them off-the-shelf. As I've talked about before, the market has been cornered by Oracle (with the INSO filters) and Autonomy (with the KeyView filters). Almost all the search engines out there use either Oracle's or Autonomy's converters. A notable exception is Microsoft, which has its own standard for this, IFilters. But IFilters are of varying quality, they don't always work with every Microsoft software product, and you may very well have to build a custom filter yourself for some ancient or rare software.

And there's ISYS -- probably the only vendor we cover in our Search & Information Access Report that has developed converters for over 200 document types entirely by themselves. (Even Oracle and Autonomy didn't really build filters themselves -- they bought the companies that produced them).

It makes sense, then, that ISYS now tries to bank on that hidden capital. The vendor announced last week it's releasing its File Readers as a separately available product. It'll be interesting to see these show up in Lucene implementations (and in content management systems embedding search). More options means more choice. Black may be the fastest drying paint, but maybe you can now have that Model T in purple again.

- Submitted by: Adriaan Bloem, Analyst - Twitter: adriaanbloem

All Search Channel Trends

Join the conversation

Digg This! Search Technorati Tag it on Del.icio.us




Get a Free Sample

Wondering about CMS Watch research? Sign up to receive free samples of any of our products.




What we do

CMS Watch™ evaluates content-oriented technologies, publishing head-to-head comparative reviews of leading solutions. What makes us special?

  • Our critical analysis exposes product weaknesses as well as strengths
  • We deliver unrivaled technical depth and comprehensive project advice
  • Our research is led by international topic experts
  • We only work for buyers -- never for vendors

Contact us

CMS Watch

info@cmswatch.com

3470 Olney-Laytonsville Road Suite 131

Olney, MD USA 20832

1 800 325 6190

1 617 340 6464

UK: +44 2033181911

Fax: +1 617 340 3541