Web Content Migration
Migration Tools and Techniques
by Reza Haniph
14-Apr-2004

The emergence of web content migration as a specialty discipline
may come as a surprise to many readers. It's rarely mentioned by CMS software
vendors, and typically shows up as one or two sub-bullets on an RFP. After
all, how hard can it be to move your legacy content into your new CMS?
How Hard Can it Be?
When it comes to content, appearances can be deceiving. Most content is riddled with errors, especially in markup, but modern web browers are very forgiving. Web content leads a double life. From the outside, the look and feel is crisp and polished. But under the surface, we find code mingling with carefully crafted prose, navigational structures intertwined with script logic fed to the browser, and spacer elements introduced as visual artifices. All of this can become a jumbled mess.
Let's take a simple example, a page from the NASA site: mars rover news.
- Deciding what's "presentation" and what's "content" is difficult. Do you leave or extract the "NASA News" logo? What about the legal disclaimer? Or the link to the text-only version of the content? There are internal and external links on the page. Are some of the internal links changing? Should they be changed when migrated? Do the external links still work?
- Do you keep the contact information? Or will that change dynamically as roles are updated, people are promoted or departments are restructured?
- What are the content elements you need to capture for your CMS? Is the release information needed or is the title sufficient? Do you capture the content as one giant block or will you break it into paragraphs?
- Do you keep the in-line bold formatting? Or will your CMS add the formatting?
- What metadata will you capture for this piece of content? Can you be sure your content contributors and managers will tags your content consistently with metadata when migrating?
- Is the content well-formed (no open tags or incorrect nesting of tags)?
- Although not shown on this simple example, images present an additional layer of complexity. Sometime they come with titles, but this may vary from page to page. Check out this example, also from NASA.

Companies having gone through a complex CMS implementation will tell you: migration is hard. In this article, we've blended the observations of a migration solutions developer and a taxonomy strategist, because it takes a broad skill set to deliver a migration solution. Our viewpoints are based on what we've learned from multiple different companies undergoing migrations in the past two years.
We'll explore two areas in this article that help web content migrations flow better and faster: taxonomies and technology. Taxonomies are very much in vogue now, because of a general realization that they can attach badly-needed importance and relevance to your data, but taxonomies can also help you structure your migration by facilitating the mapping of information from your old information structures to your new site. Technology can help speed up the migration process. Our emphasis will be on the practical: small things that can help your migration in a big way.
A Practical Taxonomy
Taxonomies are inherently critical to successful content management. They define your content and how users will interact with that information. Taxonomies also provide the "hooks" that allow you to easily retrieve the content that will be stored in your CMS.
Taxonomies also become important in defining how you structure your migration. Pages that share the same facet of a taxonomy (e.g. press releases, product information pages) frequently share the same structure. You can take advantage of this structure when you migrate: everything from designing extraction rules to tagging content becomes easier.
So it's not surprising that we'd like to see the taxonomies addressed much earlier in the migration process. We also stress a viewpoint that's "built to last", with a particular focus on how customers will use the information in your CMS. Without early planning, teams frequently emerge from a content migration with an informal or implicit taxonomy that doesn't match the needs of the content they started with. Without a formal target structure, migration staffers often fall back to de facto taxonomies that become a "bucket per document," where each object that is migrated becomes its own unique category. These static, flat, and monolithic taxonomies are complex and cumbersome to manage, and typically don't work well for content consumers, either.
Keeping it Small and Flexible.
At the same time, the launch of a new CMS should not impel you to create the perfect taxonomy. What's needed is a smaller, flexible data model. Not only will this aid your site visitors, but it allows you to migrate your content consistently.
The way to do this is to tease out the dimensions, or "facets" that need to be examined, and represent them as a single data model. Examples of common facets include:
- market segments,
- products,
- organizations,
- locations,
- target markets.
With careful analysis, you can cover multiple permutations without resorting to a single and potentially very cumbersome master taxonomy. Take the wine.com example used by Peter Morville's in his article on "The Speed of Information Architecture". Using five stable facets to describe wine (type, region, winery, price and rating) leads to almost 20 million possible combinations.
You can build flexibility by finding facets that are more stable. States and countries are stable. Sales regions are not. Industries are stable. The decision to market high-tech, financial services and telecoms this month is not. Come up with the stable component pieces, then use these to determine the more dynamic lists.
So how many facets do we normally see in organizations? Frequently most organizations need between 4 and 10 facets. The four we see a lot are products and services, locations, organizations (customers, competitors, etc) and content-types. These four facets frequently correspond to groups of content on your legacy site, making the tagging process easier during a migration. For example, you legacy content may already be grouped on your site under press releases in a specific country.
Understand How Users Search
Remember that people generally don't search on the web via topic/subject, because they've been trained to understand that a general search yields poor results that are too abstract and contain too many hits. The more general ("I want press releases about personal computers"), is less important than the specific ("I want press releases on the Pentium 4").
Working with a company's call center, we found that most of their operators were searching for content using two facets (product and content-type) to answer specific customer questions. They were either searching for specific product information or looking for best practices, technical definitions, forms, manuals and FAQs.
This caused us to hone in on the general migration rules around the product and content-type facets. These rules included what content to extract, transformation rules and specifications for the CMS content import process. It also helped with other CMS design issues, including the format of content/design templates and the development of approval/review workflows.
Develop a Taxonomy Change Process
Assume that the taxonomy you put in place during the design phase will need to change as you migrate and test. Users will invariably find documents and content that can't fit into a taxonomy bucket. Having a process in place to manage the additions and changes to the taxonomy as well as a technical approach to making changes is critical during your migration.
Using Practical Technology Solutions
As ever, the technology decisions you make for your migration come down to selecting the right tool for the right job. Migration tools can amplify your processes, and remove critical bottlenecks, but cannot replace the human touch: what constitutes high-value content, how content needs to be decomposed, what users will really extract from a page. However, we've never come across a well-designed migration program that was not assisted by some level of automation.
Speed up Your Migration
The economics of a technology-assist are compelling. A human operator can, at most, extract content from 20 - 40 pages/day. However, even this content may need to be processed further to correct links and clean-up the encoding. We've seen extraction technology that achieves speeds of more than 1,000 pages/day, albeit in nearly ideal conditions.
Don't expect complete automation. Many tasks are still dependent on a manual process. Content analysis, the design of your site navigation, template design and building a review/approval workflow still require a skilled specialist. However, using extraction technology can remove tasks that are both highly error prone and time-consuming (see below).
Automate Consistency
Automation allows you to set rules against many pages at once when conducting your extraction. Some tools are also able to automatically identify the objects that can be classified as "presentation" to aid in the template design process. This level of consistency can save migration teams many weary nights reviews and cleaning-up the results of manual extractions.
As an example, we heard from a migration team monitoring the progress of an off-shore extraction effort. There were many variations in page styles, so rather than specify specific rules for each page, general rules were given for the site to each off-shore resource. Problems arose as these rules were interpreted differently. For example, some people correctly identified and extracted "related link" elements, while other thought they were part of the presentation elements (and didn't extract them as content). An automated process would have grouped content into related "buckets", and allowed the extraction team to specify rules against each "bucket".
Automate Encoding Conversion
Most source content contains a mixture of encodings. An encoding specifies the byte(s) that are used to represent each character. Sometimes referred to as "international support" or "multi-byte characters", source content frequently needs to be converted from many different encodings to a common encoding. A typical choice for a CMS system is UTF-8, based on the Unicode Standard. Again, manual users may struggle to consistently ensure that character encodings are transformed.
Automate Link Handling
Content contains many links to other places. These links are an essential part of the migrated content. Managing the interplay between links as pages are migrated becomes excessively complex as the size of a migration increases. The migration team needs to understand how links will be remapped to their new location within the CMS and how existing links on non-migrated content will change (e.g. how to handle page redirects).
This is not a trivial task. When asked about the one thing he wished he could change during his migration of the American Speech-Language-Hearing Association (ASHA) site, David Tauriello quickly responded with "links". David and his team migrated over 4,000 pieces of ASHA content over a 4-month period last year. They were forced to rely on a manual link solution. Nowadays automated solutions have features that allow the easy handling of link transformation as content is migrated.
Automate Content Cleaning
HTML-formatted content is almost never clean. It's always easy to make mistakes in HTML, and many times these mistakes are amplified as formatting in presentation elements is re-used. HTMLTidy is the utility of choice here. From missing or mismatched tags to the incorrect placement of tags, Tidy quickly corrects markup. This is an essential part of any migration team's toolkit.
Justify your Investment
Technology should always be examined in light of alternative options. In this case, it's usually manual migration methods. Costs for a manual migration frequently average $20 - $40 per page, when all expenses are considered. These numbers come from our customers who have completed manual migrations. This includes training time, cutting and pasting page information into the correct fields in the CMS database, creating CMS database records, cleaning up broken html, remapping links, adding suitable metatags, correcting any encoding problems, correcting any mistakes (this is a complex manual process) and review time. Automated solutions are always cheaper than the manual alternatives. The main benefit is an immediate cost reduction compared to manual methods (on the order of 33% to 50% cheaper).
In addition, you'll get the acceleration of benefits from the CMS. For example, you'll get your content migrated six months earlier using a technology-assist, so that's six months worth of extra of CMS-related benefits you'll reap. In many cases, this benefit may be even larger than the total cost of the migration technology.
Develop a Menu of Options
The migration team should look at providing a menu of service options and prices, so that business units can understand what they are paying for. Business units can then determine the appropriate level of service at a price they can afford. The key driver may well be your content authors. These are expensive and highly-leveraged resources that also have day-jobs. We've found that some smaller sites can invest the time of their content authors to re-craft their sites within the CMS system using existing templates. Other, much larger sites, need help migrating their content, and can't divert content authors as readily.
Joseph Busch contributed to this article. He is the founder and principle of Taxonomy Strategies, a firm specializing in the development of metadata frameworks and taxonomy strategies for global 2000 companies, government agencies and NGOs.


