• Home
  • Research
  • What We Offer
  • Who We Are
  • Blog
  • Your cart is empty.
  • Log in
  • Subscribe
  • Free Sample
  • Contact
  • Recent Entries
  • Get Custom Feeds
Team Blog
Free Research Sample
Bloem

Read me that file so I can index it, please

Added By Adriaan Bloem at 8-Apr-2009 | Twitter: @adriaanbloem |

One of those easy-to-overlook but important details of a search engine: will it actually read your files? You may be interested in Lucene, but you'll have to find a way to feed it Office documents and PDFs.

Search engines don't actually directly index the Word document or PDF, they index text. This is where document filters come into play. These do their best to get the text from the file (and usually some metadata, such as an "author" field). If you've ever tried to open some exotic document format in a plain text editor (i.e., Notepad, or VI) you'll understand this can be far from trivial: many of these formats aren't very straightforward.

The problem isn't just trying to find the text, there are quite a few complications: reading across two or three column layouts; what to do with footnotes; or what to index, period. Spreadsheets are troublesome, but what do you make of images, audio, video? And for many scenarios (like indexing a file share) there will be exotic file types to deal with. (I recall the comments at a municipality once: "But we don't have any exotic file types". Three months later, a full crawl unearthed a stack of CAD/CAM files that were vital for planning). To make matters worse, file formats change with the software versions that come out (will the converter read Office 2007 or just Office 95?).

Since it's complicated to build and maintain good filters, most vendors buy them off-the-shelf. As I've talked about before, the market has been cornered by Oracle (with the INSO filters) and Autonomy (with the KeyView filters). Almost all the search engines out there use either Oracle's or Autonomy's converters. A notable exception is Microsoft, which has its own standard for this, IFilters. But IFilters are of varying quality, they don't always work with every Microsoft software product, and you may very well have to build a custom filter yourself for some ancient or rare software.

And there's ISYS -- probably the only vendor we cover in our Search & Information Access Report that has developed converters for over 200 document types entirely by themselves. (Even Oracle and Autonomy didn't really build filters themselves -- they bought the companies that produced them).

It makes sense, then, that ISYS now tries to bank on that hidden capital. The vendor announced last week it's releasing its File Readers as a separately available product. It'll be interesting to see these show up in Lucene implementations (and in content management systems embedding search). More options means more choice. Black may be the fastest drying paint, but maybe you can now have that Model T in purple again.

Categories: Adriaan Bloem, Search and Information Access, Industry Standards, Marketplace at Large, IDOL Server, ISYS Search Suite, Lucene, Secure Enterprise Search 10g

  • Tweet This Entry

Online Education

Check out our classes and Register Today.

Evaluation Research

Get the real story about vendors and products.

My Research

Remember MeForgot password?

Not a subscriber? Learn about our subscriptions

Categories

Channel

  • Collaboration & Community Software (128)
  • Web Analytics (151)
  • Web Content Management (802)

Analyst

  • Adriaan Bloem (46)
  • Tony Byrne (661)
  • Apoorv Durga (8)
  • Jarrod Gingras (33)
  • Alan Pelz-Sharpe (65)
  • Theresa Regli (36)
  • Kas Thomas (77)

Topics

  • Asia-Pacific Marketplace (3)
  • Building Business Case (142)
  • Cloud Computing (6)
  • E-Discovery (1)
  • European Marketplace (16)
  • Governance (14)
  • Implementation (218)
  • Industry Events (1)
  • Industry Standards (111)
  • Information Architecture (84)
  • Intranets (6)
  • Marketplace at Large (505)
  • Open Source (93)
  • Selecting Technology (548)
  • Services Oriented Architecture (4)
  • Software-as-a-Service (18)
  • Usability (7)
  • Vendor Viability & Financials (129)
  • XML (28)

Industries

  • Finance (2)
  • Government (21)
  • Health Care (2)
  • Higher Ed (7)
  • Legal (1)
  • Manufacturing (2)
  • Pharma (1)
  • Publishing-Media (4)
  • Retail (7)

Dates

  • 2010 (69)
  • 2009 (200)
  • 2008 (223)
  • 2007 (166)
  • 2006 (99)
  • 2005 (104)
  • 2004 (58)
  • 2003 (67)
  • 2002 (67)
  • 2001 (28)

Have Questions?

Sales & Customer Support

+1 800 325 6190 (USA)+44 (0) 20 3318 1911 (UK)+1 617 340 6464 (Int'l)sales@realstorygroup.com support@realstorygroup.com

All other inquiries: info@realstorygroup.com

Copyright, 2001 - 2010, Real Story Group. All rights reserved.

  • Contact Us
  • Copyright Policy
  • Privacy Policy
  • Terms of Use

The Real Story Group

  • CMS Watch
  • Enterprise Information
       Watch
  • SharePoint Watch
  • The Real Story Group

Research

  • Vendor Evaluations
  • Webinars & Advisory Papers
  • Online Education
  • Vendor Lists
  • Free Research Sample
  • Purchase Now

What We Offer

  • Research & Advisory
       Services
  • Frequently Asked Questions
  • Consulting Services
  • Customer Support
  • Contact Sales Team

Who We Are

  • We're Different
  • Our Team
  • Media
  • Customer List
  • Events
  • Contact Us

Get the real story via our bi-weekly newsletter.

Follow us on: RSS twitter

Log In

Remember MeForgot password?