Get the real story via our bi-monthly newsletter

Search

    2
    0

rss

Send to a colleague

Home > Web Content Management > A Lexicon for Document Analysis

Get a Free Sample

Wondering about CMS Watch research? Sign up to receive free samples of any of our products.

Report Excerpt

The Web CMS Report 2009 looks at... Classification and Metadata

"SharePoint provides comparatively weak support for metadata. Customers are continually challenged by the lack of hierarchical vocabulary schemes and the inability to effectively manage controlled vocabularies within a central store. "

(p. 109)

More about The Web CMS Report 2009

Our customers say

"This excellent report has saved weeks of work reviewing the market place to enable a tender to be sent out to just a handful of potential vendors in record time. Well done.
- - Martin Beake,
ITT Consultant, 2Sys Limited, Malmesbury, UK

NEW at CMS Watch

The SharePoint Report 2009 The SharePoint Report 2009: This report will help your team decide whether and where and when to apply SharePoint to your information management problems.... Read more
Evaluating Native SharePoint Services SharePoint Online Education Course: This course will enable you to assess whether, where, and how to use SharePoint... Read more
The Web CMS Report 2009 The Web CMS Report 2009: In its 15th edition, this report evaluates 42 web content management systems and vendors... Read more

Glossary

Categorization

Collection

Corpus

Document Management

Java

Metadata

Open Source

RDBMS

Structured Data

Taxonomy

Thesaurus

Unstructured Information

Workflow

XML



 

Document Analysis

A Lexicon for Document Analysis

by Tony Byrne
12-Aug-2007

One of the challenges of any content technology project is standardizing on a particular set of terminology. Without that, you risk confusion among the business users, developers, managers, vendor staff, and consultants who may all participate in your project.

And one area where I see a lot of different terminology is in the domain of content analysis. Content analysis is becoming increasingly important, particularly on the web, where more enterprises want to take advantage of the latent structure of some of their online information to better re-use information across locales and channels.

It's particularly important to agree on what you will call document types and the elements that make up structured document types. At CMS Watch we use the phrase "type" to define a particular document model or structure, and "element" to describe the constituent pieces that make up a content type. Of course some or even most content types are not structured and therefore have no elements (or contain, essentially, a single "body" element).

A Simple Thesaurus

Since the electronic publishing industry is still young, myriad analysts, consultants, and vendors use their own terminology, which can make things very confusing. Here's a chart of some commonly used synonyms or approximations for type and element.

"Types" "Elements"
"Classes" "Snippets"
"Objects"* "Objects"*
"Documents / Pages" "Nodes / Fields
"Templates" "Pagelets / Portlets"
"Chunks"* "Chunks"*
"Archetypes" "Styles"
"Models" "Bricks / Parts / Fragments"
  "Placeholders"
  "Components"

* To make things even more confusing, sometimes different analysts will use the same word to mean different things. Indeed some will argue that types and elements are really just part of a theoretically limitless continuum of structure in your overall information architecture, and thus it's reasonable to employ the same term for both concepts.

Synonyms for Content Types

Defining effective Content or Document Types lies at the core of all Web Content / Document / Records Management. Many subsystems within a management application will pivot off Content Types, including workflow, access control, templating, and classification.

But, what to call them? In addition to types, you'll come across many other synonymous or loosely-related terms.

"Classes" -- helpfully implies that you have classified your types, but carries significant baggage as a specific software term.

"Objects" - a nice catch-all term that nonetheless can confuse developers if the type does not enable true object-oriented attributes like inheritance.

"Documents / Pages" -- typically understood as a particular instance of a type, and does not imply a standard model for a uniform collection of content items.

"Templates" -- a useful term that, however, most people associate with presentation models, rather than structural types. Many vendors will still use the term templates to imply "logical" templates, which is fine in traditional software development, but gets too confusing in the Web world, where templates are associated with layout or at least look-and-feel.

"Chunks"- more commonly used to describe elements (see below).

"Archetypes" -- a more arcane alternative to types, but the reference to a standardized model perhaps makes it clearer.

"Models" -- an accurate term -- inasmuch as the analytical work of defining a content type is often referred to as "content modeling." However, I find the term is sometimes too abstract and otherworldly for end-users. It's also confusing because in the information architecture world, "model" usually refers to a higher-level representation of a collection of content or a broader information set, in addition to just an individual document. For example, a "repository model" could include the folder directory and probably classification scheme as well.

Synonyms for Content Elements

Content elements represent the constituent pieces of a structured content type. By breaking a document into its standard elements, you can do useful things with those elements, such as re-use them elsewhere, or define different formats for them on a web page. Many different consultants and vendors will employ handy -- but differing -- terms for elements.

"Snippets" -- often used to refer to discrete blocks of HTML (or code of any kind), the term is common lingo for some Web CMS developers and vendors.

"Objects" - see above. Used by many consultants and content management vendors.

"Nodes / Fields" -- techie terms that come out of, respectively, the XML and relational database worlds. Nodes and fields may be technically accurate for how the data is persisted, but it just doesn't feel like content. Moreover, in practice a content element might span multiple nodes and fields in a data store. To be fair, element has specific meaning in the XML world and in fact is often confused with node among XML practitioners.

"Pagelets / Portlets" -- Pagelets are a nice extension of the Web metaphor, but inaccurate if you are publishing to non-web devices; nevertheless, the term is used by some Web CMS vendors. Portlets represent elements within the world of (primarily Java-based) Portal software, but also refer to a particular specification around accessing remote content or services from within a portal framework.

"Styles" -- a potential misnomer from desktop and web publishing where "styles" are used to standardize the appearance of certain elements. For example, some content management or transformation utilities can map Word or Quark styles to specific repository elements in content conversion, essentially overloading a layout convention with structural meaning. The CMS underneath this website (Midgard) uses the term styles for elements.

"Chunks" - a favored term of information architects and end-users alike -- useful because it implies elements of varying size and purpose. Also conveniently morphs into the verb, "chunking."

"Bricks / Parts / Fragments" -- harkens to the notion of page "assembly"-- manufacturing vernacular comes to e-publishing. Microsoft even calls its re-usable elements "Web Parts," although here they are more of analogue to "portlets." "Bricks" are similar to "chunks," but imply you want to stack them, which is not always the case. "Fragments," like snippets, tend to imply a truncated subset of code.

"Placeholders" -- a nice, user-friendly term implying a "fillable" element, sometimes used by Microsoft and other vendors.

"Components" -- a popular and useful term, suggestive of document "decomposition" and "recomposition." The problem here is that components and "composability" have specific meaning in the world of Services Oriented Architecture (SOA), where they tend to imply a unit of functionality, rather than a unit of information or simply a container.

Extending Types and Elements

If you're focusing on the web, you should recognize that similar confusion can arise over structured display templates as well. As websites get larger and more complex, and enterprises want to manage multiple different sites from within a single application, the need arises for more object-oriented or "nested" template structures. So what do you call the pieces of templates that you might assemble into a full template? We'd say call them template elements. Some vendors will call them code snippets, or template fragments, or just templates that you assemble into larger templates. I find that all of those terms quickly become confusing.

Also, remember never to confuse Document Types or Elements with the instances of those containers, which are individual content items. Different systems will refer to those items as documents, pages, or records. I find item is simplest, and most technology- and format-neutral term.

Standardize, Regardless

In sum, we like to use types, elements, and items in part because they are the least overloaded terms and hence least likely to cause confusion. Don't underestimate the value of that: there is already much else to confuse you in the world of content technologies.

Ultimately, though, it doesn't really matter which terms your enterprise settles on -- just standardize on something so you can communicate effectively with internal and external stakeholders.

Tip of the hat to Bob Boiko, author of The CMS Bible, for first describing the value of Types and Elements for me.


Next:

Send Feedback

See all Web Content Management Channel feature articles.

Need to select a technology vendor, but confused about your choices? See our vendor-neutral technology reports.

Join the conversation

Digg This! Search Technorati Tag it on Del.icio.us



About the Author

Tony Byrne

Tony is Founder of CMS Watch, a vendor-neutral analyst firm that evaluates content technologies and publishes reports comparing different solutions head-to-head. Tony serves as executive editor of all CMS Watch evaluation reports, each available for sale on this site.



Get a Free Sample

Wondering about CMS Watch research? Sign up to receive free samples of any of our products.



What we do

CMS Watch™ evaluates content-oriented technologies, publishing head-to-head comparative reviews of leading solutions. What makes us special?

  • Our critical analysis exposes product weaknesses as well as strengths
  • We deliver unrivaled technical depth and comprehensive project advice
  • Our research is led by international topic experts
  • We only work for buyers -- never for vendors

Contact us

CMS Watch

info@cmswatch.com

18113 Town Center Drive, Ste 217

Olney, MD USA 20832

1 800 325 6190 (customer service)

+1 617 763 5336 (int'l customer service)

Fax: +1 214 242 3048