Document Analysis
A Lexicon for Document Analysis
by Tony Byrne
12-Aug-2007

One of the challenges of any content technology project is standardizing on a particular set of terminology. Without that, you risk confusion among the business users, developers, managers, vendor staff, and consultants who may all participate in your project.
And one area where I see a lot of different terminology is in the domain of content analysis. Content analysis is becoming increasingly important, particularly on the web, where more enterprises want to take advantage of the latent structure of some of their online information to better re-use information across locales and channels.
It's particularly important to agree on what you will call document types and the elements that make up structured document types. At CMS Watch we use the phrase "type" to define a particular document model or structure, and "element" to describe the constituent pieces that make up a content type. Of course some or even most content types are not structured and therefore have no elements (or contain, essentially, a single "body" element).
A Simple Thesaurus
Since the electronic publishing industry is still young, myriad analysts, consultants, and vendors use their own terminology, which can make things very confusing. Here's a chart of some commonly used synonyms or approximations for type and element.
| "Types" | "Elements" |
| "Classes" | "Snippets" |
| "Objects"* | "Objects"* |
| "Documents / Pages" | "Nodes / Fields |
| "Templates" | "Pagelets / Portlets" |
| "Chunks"* | "Chunks"* |
| "Archetypes" | "Styles" |
| "Models" | "Bricks / Parts / Fragments" |
| "Placeholders" | |
| "Components" |
* To make things even more confusing, sometimes different analysts will use the same word to mean different things. Indeed some will argue that types and elements are really just part of a theoretically limitless continuum of structure in your overall information architecture, and thus it's reasonable to employ the same term for both concepts.
Synonyms for Content Types
Defining effective Content or Document Types lies at the core of all Web Content / Document / Records Management. Many subsystems within a management application will pivot off Content Types, including workflow, access control, templating, and classification.
But, what to call them? In addition to types, you'll come across many other synonymous or loosely-related terms.
"Classes" -- helpfully implies that you have classified your types, but carries significant baggage as a specific software term.
"Objects" - a nice catch-all term that nonetheless can confuse developers if the type does not enable true object-oriented attributes like inheritance.
"Documents / Pages" -- typically understood as a particular instance of a type, and does not imply a standard model for a uniform collection of content items.
"Templates" -- a useful term that, however, most people associate with presentation models, rather than structural types. Many vendors will still use the term templates to imply "logical" templates, which is fine in traditional software development, but gets too confusing in the Web world, where templates are associated with layout or at least look-and-feel.
"Chunks"- more commonly used to describe elements (see below).
"Archetypes" -- a more arcane alternative to types, but the reference to a standardized model perhaps makes it clearer.
"Models" -- an accurate term -- inasmuch as the analytical work of defining a content type is often referred to as "content modeling." However, I find the term is sometimes too abstract and otherworldly for end-users. It's also confusing because in the information architecture world, "model" usually refers to a higher-level representation of a collection of content or a broader information set, in addition to just an individual document. For example, a "repository model" could include the folder directory and probably classification scheme as well.
Synonyms for Content Elements
Content elements represent the constituent pieces of a structured content type. By breaking a document into its standard elements, you can do useful things with those elements, such as re-use them elsewhere, or define different formats for them on a web page. Many different consultants and vendors will employ handy -- but differing -- terms for elements.
"Snippets" -- often used to refer to discrete blocks of HTML (or code of any kind), the term is common lingo for some Web CMS developers and vendors.
"Objects" - see above. Used by many consultants and content management vendors.
"Nodes / Fields" -- techie terms that come out of, respectively, the XML and relational database worlds. Nodes and fields may be technically accurate for how the data is persisted, but it just doesn't feel like content. Moreover, in practice a content element might span multiple nodes and fields in a data store. To be fair, element has specific meaning in the XML world and in fact is often confused with node among XML practitioners.
"Pagelets / Portlets" -- Pagelets are a nice extension of the Web metaphor, but inaccurate if you are publishing to non-web devices; nevertheless, the term is used by some Web CMS vendors. Portlets represent elements within the world of (primarily Java-based) Portal software, but also refer to a particular specification around accessing remote content or services from within a portal framework.
"Styles" -- a potential misnomer from desktop and web publishing where "styles" are used to standardize the appearance of certain elements. For example, some content management or transformation utilities can map Word or Quark styles to specific repository elements in content conversion, essentially overloading a layout convention with structural meaning. The CMS underneath this website (Midgard) uses the term styles for elements.
"Chunks" - a favored term of information architects and end-users alike -- useful because it implies elements of varying size and purpose. Also conveniently morphs into the verb, "chunking."
"Bricks / Parts / Fragments" -- harkens to the notion of page "assembly"-- manufacturing vernacular comes to e-publishing. Microsoft even calls its re-usable elements "Web Parts," although here they are more of analogue to "portlets." "Bricks" are similar to "chunks," but imply you want to stack them, which is not always the case. "Fragments," like snippets, tend to imply a truncated subset of code.
"Placeholders" -- a nice, user-friendly term implying a "fillable" element, sometimes used by Microsoft and other vendors.
"Components" -- a popular and useful term, suggestive of document "decomposition" and "recomposition." The problem here is that components and "composability" have specific meaning in the world of Services Oriented Architecture (SOA), where they tend to imply a unit of functionality, rather than a unit of information or simply a container.
Extending Types and Elements
If you're focusing on the web, you should recognize that similar confusion can arise over structured display templates as well. As websites get larger and more complex, and enterprises want to manage multiple different sites from within a single application, the need arises for more object-oriented or "nested" template structures. So what do you call the pieces of templates that you might assemble into a full template? We'd say call them template elements. Some vendors will call them code snippets, or template fragments, or just templates that you assemble into larger templates. I find that all of those terms quickly become confusing.
Also, remember never to confuse Document Types or Elements with the instances of those containers, which are individual content items. Different systems will refer to those items as documents, pages, or records. I find item is simplest, and most technology- and format-neutral term.
Standardize, Regardless
In sum, we like to use types, elements, and items in part because they are the least overloaded terms and hence least likely to cause confusion. Don't underestimate the value of that: there is already much else to confuse you in the world of content technologies.
Ultimately, though, it doesn't really matter which terms your enterprise settles on -- just standardize on something so you can communicate effectively with internal and external stakeholders.
Tip of the hat to Bob Boiko, author of The CMS Bible, for first describing the value of Types and Elements for me.


