Taxonomies
Value of Organized Knowledge
by Jack Bryar
21-Jan-2002

An Old Problem Gets Worse
In recent years, the volume of news and information resources available to the typical corporate employee has grown exponentially. Corporate Web traffic has jumped by over 600% annually. Web-available content exceeds several billion items. Executives frequently receive more than 200 e-mails a day. The amount of corporate data generated per employee doubles every 18 months.
- "If you printed the information available through our Intranet, it would stretch from the earth to the sun."
-- Marc Auckland, World-Wide Chief Knowledge Manager, British Telecom
Corporate managers are worried. Sixty percent say that info-glut is having a negative effect on productivity. IDC estimated that in 1999, US Fortune 500 companies lost $12 billion due to an inability to locate knowledge resources amidst all the clutter. Eighty percent of executives believe the problem will get worse before it gets better.
Adding to the strain is the fact that this content is so difficult to access. Corporate information exists in many forms. Each form (electronic news, email, databases, Web pages, archived documents, etc) resides in its own format, accessible only though some unique index system. Often content is not easily accessible. Much of it is scattered across the enterprise.
Yet, access is critical. New applications have sprung up requiring access to information across the enterprise, sometimes across multiple businesses. These include next-generation customer care, competitive research, and B2B transactions. In order to get at the information needed to run these applications, information itself needs to be re-structured, and re-organized -- and so does the method of getting at that information.
Info-Illiteracy: A Barrier to Finding Information
The problem of business info-glut is worse than it appears. Many employees lack the skills needed to find the information they require.
For years, putting tools in the hands of the users was considered the best way for companies and their knowledge workers to get their hands on the information they needed. In most cases, that has meant providing users with a search engine similar to systems found on public Websites. Today, many knowledge workers have to navigate as many as six different search engines and database indexes each day.
New research casts doubt on how well search engines works for most users. A study of AltaVista users revealed a surprising amount of info-illiteracy. According to that study:
- 80% couldn't/wouldn't build a working Boolean search
- 87% used less than 3 words

A big part of the problem is that the same term can have different meanings to different people. Not knowing which terms will uncover sought-after information is a significant barrier for many knowledge workers. Any successful strategy for managing information has to overcome this problem.
- In 1814, Thomas Jefferson was so dissatisfied with the ruined and disorganized state of the documents at the Library of Congress that he donated his collection and then personally reclassified the all the books there.
-- Source: Systems of Knowledge Organization for Digital Libraries, Gail Hodge
XML to the Rescue?
The Internet has been described as the world's largest library, with the books thrown all over the floor. Many corporate information systems look just as disorganized. Information managers are convinced that the best solution to this clutter involves wrapping up all electronic document forms inside a common format, so that the content inside can be more easily found, and used by different applications.
The wrapper being used by most organizations today is XML. XML allows the tagging of a document with a description of what the document is about, and where it came from. Searching on XML meta-tags can certainly simplify the search process.
Unfortunately, XML does not solve the problem of finding information. It only standardizes the problem. It requires that any XML tagging system clearly understand what the document is about, and it needs to anticipate the search process someone might try to use to find it. This takes time, a great deal of sophistication, or both. Otherwise, the process results in hiding essential documents behind generic, idiosyncratic or meaningless tags, making the information management and retrieval problem even worse.
In order for XML tagging to be meaningful for search and retrieval, the terms used to tag content have to be intuitive enough to encourage their use by information-seekers. They should be structured in a standardized way; less as a set of variable keywords and more like a set of subject categories. These subject categories should be set up in a hierarchical fashion, with logical subtopics and overviews. This, in short, is a taxonomy.
Enabling an ability to search or manipulate content, "by category" is an essential benefit of a successful XML tagging process.
Taxonomies Defined
Taxonomies are sometimes called "classification schemes" or "categorization schemes." Each refers to grouping together similar items into broad "buckets" or "topics" which themselves can be grouped together in ever-broader "hierarchies." Examples of taxonomies include systems as diverse as the Dewey Decimal system found in small libraries, Yahoo's Subject Index, and the massive taxonomic system proposed by Linneaus used by generations of biology students. Wherever they are used, they have the same goal -- to organize knowledge about a given subject.
A sample taxonomy from NewEdge:

Taxonomies and The Search Process
Perhaps the greatest benefit to taxonomies is improved searching.
Properly constructed taxonomies simplify the process of gathering "the right" information for daily business use by simplifying the vocabulary used in the search process. Tagging systems using raw key words or similar strategies are likely to generate search error rates approaching that of straight text searches. For example, while a search on the word "DSL" will find stories on a particular type of broadband technologies, it will miss others, and may accidentally find content referring to Dutch sign language or Data SubLanguage.
A better approach would be to define these documents as belonging to the subject category, "Digital Subscriber Line." If the searcher can focus on a proven set of categories rather guess at keywords, chances of finding the right content, are far greater, and the process will be faster and more reliable.
The most important contribution of taxonomies to the search process is that they work.
Even using a relatively primitive taxonomic system, Microsoft reported a 40% improvement in hit rates. Satisfaction metrics doubled. In addition, the time spent trying to find a given document was significantly reduced. The success rate of taxonomic-based searching reduces the strain on systems and on the people who use them.
Business is Complicated
Naturally, one of the most important criteria for taxonomy is that it should be easy to navigate. But building solid taxonomies is much easier said than done. Consider, for example, a taxonomy of business subjects.
Businesses vary in size and have multiple points of focus. Business activities involve an array of subjects that do not always fall into logical groupings. Subject boundaries are often fuzzy.
Subject hierarchies can feel artificial, as content, particularly business critical content, may fall into multiple categories. Indeed, most executive-level business documents involve several categories. Traditional indexing schemes dissolve in complexity as the number of unique concepts grows.
So, while some subjects are relatively easy to categorize, most business functions are not. (I should know: NewsEdge has spent several years developing a proprietary business taxonomy). Nevertheless, you should seriously consider developing a taxonomy for the content management system residing underneath your e-business efforts in general, and your Intranet in particular. Your content contributors and end-users alike will be grateful.
