Get the real story via our bi-monthly newsletter

Search

    2
    0

rss

Send to a colleague

Home > Web Content Management > Migration Tools and Techniques

Get a Free Sample

Wondering about CMS Watch research? Sign up to receive free samples of any of our products.

Report Excerpt

The Web CMS Report 2009 looks at... TeamSite technology

"At the same time, the technical underpinnings of the product are more than a decade old, and while Interwoven understandably does not want to completely re-invent the platform, it is becoming increasingly harder for the company to align its architecture with contemporary standards and performance expectations. For customers, making slight feature modifications can entail significant engineering work..."

(p. 181)

More about The Web CMS Report 2009

Our customers say

"I wish I had found your Web CMS Report six months ago. The "Pitfalls to Avoid" section is worth its weight in gold!
- - Georgeann Elliott Moss, Director of Internet Publishing,
Dallas County Community College District

NEW at CMS Watch

The SharePoint Report 2009 The SharePoint Report 2009: This report will help your team decide whether and where and when to apply SharePoint to your information management problems.... Read more
Evaluating Native SharePoint Services SharePoint Online Education Course: This course will enable you to assess whether, where, and how to use SharePoint... Read more
The Web CMS Report 2009 The Web CMS Report 2009: In its 15th edition, this report evaluates 42 web content management systems and vendors... Read more

Glossary

Asset Management

Document Management

Indexing

Metadata

Perl

RDBMS

Taxonomies

Version Control

Workflow

XML



 

Web Content Migration

Migration Tools and Techniques

by Reza Haniph
14-Apr-2004



The emergence of web content migration as a specialty discipline may come as a surprise to many readers. It's rarely mentioned by CMS software vendors, and typically shows up as one or two sub-bullets on an RFP. After all, how hard can it be to move your legacy content into your new CMS?

How Hard Can it Be?

When it comes to content, appearances can be deceiving. Most content is riddled with errors, especially in markup, but modern web browers are very forgiving. Web content leads a double life. From the outside, the look and feel is crisp and polished. But under the surface, we find code mingling with carefully crafted prose, navigational structures intertwined with script logic fed to the browser, and spacer elements introduced as visual artifices. All of this can become a jumbled mess.

Let's take a simple example, a page from the NASA site: mars rover news.

  • Deciding what's "presentation" and what's "content" is difficult. Do you leave or extract the "NASA News" logo? What about the legal disclaimer? Or the link to the text-only version of the content?
  • There are internal and external links on the page. Are some of the internal links changing? Should they be changed when migrated? Do the external links still work?
  • Do you keep the contact information? Or will that change dynamically as roles are updated, people are promoted or departments are restructured?
  • What are the content elements you need to capture for your CMS? Is the release information needed or is the title sufficient? Do you capture the content as one giant block or will you break it into paragraphs?
  • Do you keep the in-line bold formatting? Or will your CMS add the formatting?
  • What metadata will you capture for this piece of content? Can you be sure your content contributors and managers will tags your content consistently with metadata when migrating?
  • Is the content well-formed (no open tags or incorrect nesting of tags)?
  • Although not shown on this simple example, images present an additional layer of complexity. Sometime they come with titles, but this may vary from page to page. Check out this example, also from NASA.

Companies having gone through a complex CMS implementation will tell you: migration is hard. In this article, we've blended the observations of a migration solutions developer and a taxonomy strategist, because it takes a broad skill set to deliver a migration solution. Our viewpoints are based on what we've learned from multiple different companies undergoing migrations in the past two years.

We'll explore two areas in this article that help web content migrations flow better and faster: taxonomies and technology. Taxonomies are very much in vogue now, because of a general realization that they can attach badly-needed importance and relevance to your data, but taxonomies can also help you structure your migration by facilitating the mapping of information from your old information structures to your new site. Technology can help speed up the migration process. Our emphasis will be on the practical: small things that can help your migration in a big way.

A Practical Taxonomy

Taxonomies are inherently critical to successful content management. They define your content and how users will interact with that information. Taxonomies also provide the "hooks" that allow you to easily retrieve the content that will be stored in your CMS.

Taxonomies also become important in defining how you structure your migration. Pages that share the same facet of a taxonomy (e.g. press releases, product information pages) frequently share the same structure. You can take advantage of this structure when you migrate: everything from designing extraction rules to tagging content becomes easier.

So it's not surprising that we'd like to see the taxonomies addressed much earlier in the migration process. We also stress a viewpoint that's "built to last", with a particular focus on how customers will use the information in your CMS. Without early planning, teams frequently emerge from a content migration with an informal or implicit taxonomy that doesn't match the needs of the content they started with. Without a formal target structure, migration staffers often fall back to de facto taxonomies that become a "bucket per document," where each object that is migrated becomes its own unique category. These static, flat, and monolithic taxonomies are complex and cumbersome to manage, and typically don't work well for content consumers, either.

Keeping it Small and Flexible.

At the same time, the launch of a new CMS should not impel you to create the perfect taxonomy. What's needed is a smaller, flexible data model. Not only will this aid your site visitors, but it allows you to migrate your content consistently.

The way to do this is to tease out the dimensions, or "facets" that need to be examined, and represent them as a single data model. Examples of common facets include:

  • market segments,
  • products,
  • organizations,
  • locations,
  • target markets.

With careful analysis, you can cover multiple permutations without resorting to a single and potentially very cumbersome master taxonomy. Take the wine.com example used by Peter Morville's in his article on "The Speed of Information Architecture". Using five stable facets to describe wine (type, region, winery, price and rating) leads to almost 20 million possible combinations.

You can build flexibility by finding facets that are more stable. States and countries are stable. Sales regions are not. Industries are stable. The decision to market high-tech, financial services and telecoms this month is not. Come up with the stable component pieces, then use these to determine the more dynamic lists.

So how many facets do we normally see in organizations? Frequently most organizations need between 4 and 10 facets. The four we see a lot are products and services, locations, organizations (customers, competitors, etc) and content-types. These four facets frequently correspond to groups of content on your legacy site, making the tagging process easier during a migration. For example, you legacy content may already be grouped on your site under press releases in a specific country.

Understand How Users Search

Remember that people generally don't search on the web via topic/subject, because they've been trained to understand that a general search yields poor results that are too abstract and contain too many hits. The more general ("I want press releases about personal computers"), is less important than the specific ("I want press releases on the Pentium 4").

Working with a company's call center, we found that most of their operators were searching for content using two facets (product and content-type) to answer specific customer questions. They were either searching for specific product information or looking for best practices, technical definitions, forms, manuals and FAQs.

This caused us to hone in on the general migration rules around the product and content-type facets. These rules included what content to extract, transformation rules and specifications for the CMS content import process. It also helped with other CMS design issues, including the format of content/design templates and the development of approval/review workflows.

Develop a Taxonomy Change Process

Assume that the taxonomy you put in place during the design phase will need to change as you migrate and test. Users will invariably find documents and content that can't fit into a taxonomy bucket. Having a process in place to manage the additions and changes to the taxonomy as well as a technical approach to making changes is critical during your migration.

Using Practical Technology Solutions

As ever, the technology decisions you make for your migration come down to selecting the right tool for the right job. Migration tools can amplify your processes, and remove critical bottlenecks, but cannot replace the human touch: what constitutes high-value content, how content needs to be decomposed, what users will really extract from a page. However, we've never come across a well-designed migration program that was not assisted by some level of automation.

Speed up Your Migration

The economics of a technology-assist are compelling. A human operator can, at most, extract content from 20 - 40 pages/day. However, even this content may need to be processed further to correct links and clean-up the encoding. We've seen extraction technology that achieves speeds of more than 1,000 pages/day, albeit in nearly ideal conditions.

Don't expect complete automation. Many tasks are still dependent on a manual process. Content analysis, the design of your site navigation, template design and building a review/approval workflow still require a skilled specialist. However, using extraction technology can remove tasks that are both highly error prone and time-consuming (see below).

Automate Consistency

Automation allows you to set rules against many pages at once when conducting your extraction. Some tools are also able to automatically identify the objects that can be classified as "presentation" to aid in the template design process. This level of consistency can save migration teams many weary nights reviews and cleaning-up the results of manual extractions.

As an example, we heard from a migration team monitoring the progress of an off-shore extraction effort. There were many variations in page styles, so rather than specify specific rules for each page, general rules were given for the site to each off-shore resource. Problems arose as these rules were interpreted differently. For example, some people correctly identified and extracted "related link" elements, while other thought they were part of the presentation elements (and didn't extract them as content). An automated process would have grouped content into related "buckets", and allowed the extraction team to specify rules against each "bucket".

Automate Encoding Conversion

Most source content contains a mixture of encodings. An encoding specifies the byte(s) that are used to represent each character. Sometimes referred to as "international support" or "multi-byte characters", source content frequently needs to be converted from many different encodings to a common encoding. A typical choice for a CMS system is UTF-8, based on the Unicode Standard. Again, manual users may struggle to consistently ensure that character encodings are transformed.

Automate Link Handling

Content contains many links to other places. These links are an essential part of the migrated content. Managing the interplay between links as pages are migrated becomes excessively complex as the size of a migration increases. The migration team needs to understand how links will be remapped to their new location within the CMS and how existing links on non-migrated content will change (e.g. how to handle page redirects).

This is not a trivial task. When asked about the one thing he wished he could change during his migration of the American Speech-Language-Hearing Association (ASHA) site, David Tauriello quickly responded with "links". David and his team migrated over 4,000 pieces of ASHA content over a 4-month period last year. They were forced to rely on a manual link solution. Nowadays automated solutions have features that allow the easy handling of link transformation as content is migrated.

Automate Content Cleaning

HTML-formatted content is almost never clean. It's always easy to make mistakes in HTML, and many times these mistakes are amplified as formatting in presentation elements is re-used. HTMLTidy is the utility of choice here. From missing or mismatched tags to the incorrect placement of tags, Tidy quickly corrects markup. This is an essential part of any migration team's toolkit.

Justify your Investment

Technology should always be examined in light of alternative options. In this case, it's usually manual migration methods. Costs for a manual migration frequently average $20 - $40 per page, when all expenses are considered. These numbers come from our customers who have completed manual migrations. This includes training time, cutting and pasting page information into the correct fields in the CMS database, creating CMS database records, cleaning up broken html, remapping links, adding suitable metatags, correcting any encoding problems, correcting any mistakes (this is a complex manual process) and review time. Automated solutions are always cheaper than the manual alternatives. The main benefit is an immediate cost reduction compared to manual methods (on the order of 33% to 50% cheaper).

In addition, you'll get the acceleration of benefits from the CMS. For example, you'll get your content migrated six months earlier using a technology-assist, so that's six months worth of extra of CMS-related benefits you'll reap. In many cases, this benefit may be even larger than the total cost of the migration technology.

Develop a Menu of Options

The migration team should look at providing a menu of service options and prices, so that business units can understand what they are paying for. Business units can then determine the appropriate level of service at a price they can afford. The key driver may well be your content authors. These are expensive and highly-leveraged resources that also have day-jobs. We've found that some smaller sites can invest the time of their content authors to re-craft their sites within the CMS system using existing templates. Other, much larger sites, need help migrating their content, and can't divert content authors as readily.

Joseph Busch contributed to this article. He is the founder and principle of Taxonomy Strategies, a firm specializing in the development of metadata frameworks and taxonomy strategies for global 2000 companies, government agencies and NGOs.


Next:

Send Feedback

See all Web Content Management Channel feature articles.

Need to select a technology vendor, but confused about your choices? See our vendor-neutral technology reports.

Join the conversation

Digg This! Search Technorati Tag it on Del.icio.us



About the Author

Reza Haniph

Reza Haniph is a co-founder of Nahava, a provider of automated content migration solutions. Prior to Nahava, Reza worked with a software firm specializing in content management systems.



Get a Free Sample

Wondering about CMS Watch research? Sign up to receive free samples of any of our products.



What we do

CMS Watch™ evaluates content-oriented technologies, publishing head-to-head comparative reviews of leading solutions. What makes us special?

  • Our critical analysis exposes product weaknesses as well as strengths
  • We deliver unrivaled technical depth and comprehensive project advice
  • Our research is led by international topic experts
  • We only work for buyers -- never for vendors

Contact us

CMS Watch

info@cmswatch.com

18113 Town Center Drive, Ste 217

Olney, MD USA 20832

1 800 325 6190 (customer service)

+1 617 763 5336 (int'l customer service)

Fax: +1 214 242 3048