Get the real story via our monthly newsletter

Search

    2
    0

rss

Send to a colleague

Home > Commentary > Trends Archive > Enterprise Search Scalability: A Big Issue

Browse TrendWatch Blog

Recent Blog Entries

The Complete Archive

Trends by Vendor


TrendWatch by Channel

Web Content Management Trends

Enterprise Portals Trends

ECM Trends

Web Analytics Trends

Enterprise Search Trends

SharePoint Trends

Digital & Media Asset Management Trends

XML & Component Content Management Trends

E-mail Archiving & Management Trends


Report Excerpt

The Enterprise Search Report 2008 looks at... Mondosoft/SurfRay: MondoSearch Enterprise

"Frames have traditionally posed challenges to Web crawlers. An automated crawler requires tuning to "know" which frame contains unique content and which frame repeats boilerplate. MondoSearch, in general, does a good job of indexing frames and linking to a frame with content. However, the display of the content in dynamically generated framesets can behave erratically. If your website uses frames, test this feature first. "

(p. 236)

More about The Enterprise Search Report 2008

 

TrendWatch Blog

Enterprise Search Scalability: A Big Issue

09-Jul-2008

I was talking to a search vendor the other day who said something that really got my attention. He remarked that a customer recently came to him and asked what it would take, in terms of software, hardware, and time, to index 30 billion documents. Mind you, this was not some hypothetical exercise. The question came from somebody whose company actually has 30 billion documents under management.

Consider the dimensions of the problem. Assuming (for purposes of argument) you could index a thousand documents per second on one machine, it would take a full year just to build the index for 30 billion docs. If the solution scales linearly, building (or rebuilding) the index would keep a 100-machine server farm busy for the better part of a week.

That's considering the scenario in a static context only. In the real world, of course, documents are revised (some frequently, others never, most somewhere in-between). New docs enter the system. Old ones are dropped. Otherwise-unchanged docs are moved to new locations. Unless you can update your index(es) incrementally, in real time, as docs are added, deleted, modified, or moved, you have an index shelf-life problem.

The traditional answer to the shelf-life problem is to rebuild the index every few days (or every night, if resources allow). At the level of ten or twenty thousand docs, a total rebuild of the index every few days isn't a huge issue. But when you get beyond something like a few million docs, performing total-rebuilds on a frequent basis quickly becomes a worst practice (if indeed it's practicable at all). At some point, you need the ability to do incremental indexing.

But, someone will ask, can't you just throw more machine resources (threads, memory, cycles) at the problem? Yes and no. If you're spidering files over the wire, bandwidth exhaustion becomes an issue. If you're indexing files locally, there's an OS-imposed limit on how many files you can have open at once. There's also the question of how much file data you can hold in memory. The reason this is important is that some search systems (quite a few, actually) need to load an entire document into memory before the doc can be indexed. If you're indexing 10-megabyte PDFs, it might not matter how many threads you have available. (Note, incidentally, that most docs occupy a lot more space in memory than on disk.) And anyway, the CPU can execute only so many instructions per second, no matter how many docs you can load at once.

I bring all this up for a couple of reasons. First, if you're shopping for a search solution, you need to regard the various vendors' performance claims with more than a modicum of caution. No two search scenarios are the same, obviously. But more than that, the parameters that affect scalability and performance are numerous and non-obvious (and their interactions subtle), tending to moot most performance claims straight out of the gate.

Takeaway No. 1: If you care about performance (and you should), do your own testing. Insist on it as part of any product evaluation.

Takeaway No. 2: Get your programmers involved in the evaluation process early. Some of these issues require computer-science expertise to evaluate properly.

Also (very important), when shopping for a search solution, don't buy for your present needs. Shop for your future needs. Your company probably has ten times more content under management today than it had just five years ago. Five years from now, it could have ten times more than today. Will your search solution scale appropriately? More particularly, how will it scale? Will it scale linearly? Will it hit a brick wall?

If I were searching for a search solution, I'd ask every vendor a few simple questions:
  • How big is your biggest customer installation and what did it take to build it?
  • Can your system do incremental indexing? How often is a full rebuild required?
  • Does your indexer need to read a document into memory (whole) before indexing it, or can files be stream-processed?
  • What's the largest document your system can index without either choking or stopping after a particular number of characters?
  • How does indexing performance change as the index gets bigger? (Not just "does it slow down?" but how does it slow? Linearly? Exponentially? If it's the latter, you're going to hit a brick wall.)
  • And: Do you support 64-bit architectures?

Those are just a few conversation-starters. For more (lots more), be sure to consult our Enterprise Search Report 2008. (You can get a free sample of it online here.) And if you end up evaluating one or more search offerings in depth, please drop us a line and let us know what you learned. We're always interested in your feedback.

- Submitted by: Kas Thomas, Analyst

All Search Channel Trends

Join the conversation

Digg This! Search Technorati Tag it on Del.icio.us



Get a Free Sample

Wondering about CMS Watch research? Sign up to receive free samples of any of our products.




What we do

CMS Watch™ evaluates content-oriented technologies, publishing head-to-head comparative reviews of leading solutions. What makes us special?

  • Our critical analysis exposes product weaknesses as well as strengths
  • We deliver unrivaled technical depth and comprehensive project advice
  • Our research is led by international topic experts
  • We only work for buyers -- never for vendors

Contact us

CMS Watch

info@cmswatch.com

18113 Town Center Drive, Ste 217

Olney, MD USA 20832

1 800 325 6190 (customer service)

+1 617 763 5336 (int'l customer service)

Fax: +1 214 242 3048