Cocoon
XML Web Publishing
by John Callahan
16-Dec-2002 --

Web developers and sysadmins often concern themselves with handling traffic loads, bottlenecks, and other kinds of performance scalability. But there's more to scalability.
Most web sites aren't content scalable. This is because they cannot easily publish content in different languages (e.g., English, Spanish, etc.) to multi-channel devices (e.g., browsers, PDAs, etc.). Furthermore, the content must often be secured, personalized, and made available in multiple formats (e.g., HTML, spreadsheets) or as a web service. While content management systems (CMS) can organize content (e.g., articles, databases, fragments), a delivery, or "publishing," system must still sit in front of the CMS to render that content in many formats, for many devices, and in many languages.
The authors of the Apache Cocoon project faced similar problems when designing scalable solutions for web-based banking systems. Current delivery technologies (e.g., ASP, JSP, CFM) did not on their own provide scalable content solutions for complex, multi-channel, interactive web sites, so they invented an XML publishing framework that provides the scalability and flexibility needed to publish multi-channel content for a variety of devices, languages and formats.
The Power of Pipeline Processing
Cocoon solves the content scalability problem using the concept of "pipelines." A Cocoon pipeline associates a Uniform Resource Locator (URL) with an XML publishing process that consists of three stages:
- Generation
- Transformation(s)
- Serialization
Let's look at each stage.
First, XML content is generated by a Cocoon "generator" component. A Cocoon "generator" is pipeline component that acts like an adapter: it converts a data source into XML if necessary. The generator converts data into XML format from a variety of sources, e.g., flat files, databases, or other URLs. In a CMS environment, the generator would tap your CMS package repositories. For example, the HTMLGenerator component (a standard Cocoon generator) reads HTML from a file or URL and converts it into XHTML. This conversion to XHTML is necessary because the rest of the pipeline depends on the generator to produce well-formed XML.
Next, a Cocoon pipeline may optionally contain one or more transformations that consume XML as input and produce XML as output. A Cocoon "transformer" is a pipeline component that performs XML transformations by replacing, adding, or deleting XML content. The XSLTTransformer component is a standard Cocoon component that uses XSLT to accomplish a transformation.
Finally, a Cocoon "serializer" converts the XML content into an output stream. The XML content in the pipeline after the last transformer (if there are any) must be sent over some transmission media (usually an HTTP response stream) to the client that requested the resource. This means converting the XML content in a byte stream or a document format (e.g., RTF, PDF, MS Excel, etc.). There are many types of Cocoon serializers for converting XML content into PDF, JPG, and many other formats depending on the needs of the requesting device or user.
Figure 1: A sample
pipeline
A Simple Example
Figure 1 depicts an example of a Cocoon pipeline that contains three components: a generator, a transformer, and a serializer. The generator reads a comma-separated value (CSV) file and converts it into XML. The generator converts the entire file by bracketing the contents with a <csv:root> tag, marking each row of the file with a <csv:row> tag, and each comma-separated element with a <csv:datum> tag. After the generator, the transformer will style the generator's XML output into XHTML by replacing the <csv:root> tag with an XHTML <table> tag, the <csv:row> tag with an XHTML <tr> tag, and each <csv:datum> tag with an XHTML <td> tag. Finally, the serializer turns the XHTML into regular HTML with headers, meta, and other tags appropriate for rendering the content in a web browser.
Note that the generator produces XML elements in the http://apache.org/cocoon/csv/1.0 namespace with the "csv" prefix. This is
important because transformers are usually designed to be "namespace safe"
in that they only operate on elements within a specific namespace to perform
their transform function. All other elements are usually passed thru without
modification. In this example, the transformer converts only the "csv"
prefix elements (those elements in the http://apache.org/cocoon/csv/1.0 namespace)
into XHTML.
In short, the generator normalizes content using XML. The transformer contextualizes it with XSLT. And finally, the serializer makes the final content consumable in a particular environment. This is exciting, because it means that you don't need to solve the ubiquitous "many-to-many" content problem (having many types of content inputs, needing many types of content outputs) within your CMS environment. Most CMS packages are not particularly adept at those types of conversions and even for those that provide such services, the combinations can quickly get overwhelming. Using Cocoon can enable you to focus your content management efforts on things CMS packages are actually quite good at, like library services and workflow.
The Sitemap: The Heart of Cocoon
Since the options for generating, transforming, and serializing content could be nearly endless, you might be wondering how Cocoon references available services and invokes any specific pipeline from among the permutations available to it. Cocoon addresses these issues through the notion of a "sitemap."
The fundamental purpose of a Cocoon sitemap is to associate URLs to pipelines -- but there is a bit more to it than that. At a minimum, a Cocoon sitemap file consists of two sections:
- A components section
- A pipelines section
The components section is used to declare and parameterize all the generators, transformers, and serializers that will be used in your pipelines. Think of this as a directory of available services from which to build pipelines. Then the pipelines section contains one or more pipeline descriptions that associate particular URLs (or URL patterns) with a specific Cocoon pipeline.
Figure 2 is a complete sitemap for the CSV example above. The map:components section contains declarations for the CSVGenerator, TraxTransformer, and HTMLSerializer (note: the CSVGenerator is NOT included in the standard Cocoon distribution. It is available in the example associated with this article at http://www.sphere.com/docs/myapp.zip). The map:components section of the sitemap shown in Figure 2 also declares a WildCardURIMatcher component as a Cocoon "matcher" and at least one pipeline type (the CachingProcessingPipeline) within a map:pipelines subsection of the map:components section of the sitemap.
Each sub-section of the map:components section declares a "default" component, which gets used if a pipeline does not identify a component where required for a particular stage. The "src" attribute of each component specifies the Java class that implements a given component.
After the map:components section of the sitemap file, we can associate specific URLs or URL patterns to Cocoon pipelines in the map:pipelines section. Each pipeline is declared by a map:pipeline sub-section within the map:pipelines section. Within a map:pipeline sub-section, there may be several instances of components declared in the map:components section.
Figure 2: A complete sample sitemap file
It's In The Pipe
The most common component instance within a map:pipeline is a "matcher" component. A matcher is a Cocoon component that does not process XML, but routes requests for URLs to specific pipeline processes.
A single map:pipeline section may have multiple matchers that are declared using the map:match element. In Figure 2, we declare a matcher (using map:match) for the URL "index.html" pattern. The URL "index.html" is associated with a pipeline that includes a CSVGenerator, a TraxTransformer, and HTMLSerializer in sequence. If a request for "index.html" is received, then the default generator (i.e., CSVGenerator) scans the comma-separated values from the file data1.csv (the "source") as specified by the "src" attribute on the map:generate directive and converts it into XML as described above. Next, the default transformer (i.e., TraxTransformer) converts the CSV namespace elements into XHTML. Finally, the default serializer (i.e., HTMLSerializer) converts the XHTML into plain HTML for consumption by the client's browser. Figure 3 shows the pipeline result of the request in a browser.
Figure 3: Pipeline
result of request for index.html
Cocoon in J2EE
A Cocoon application is deployed in a Java 2 Enterprise Edition (J2EE) environment as a web application. If your J2EE web applications directory is the c:tomcatwebapps/ directory, then the sitemap is found in the c:/tomcatwebapps/myapp/sitemap.xmap file (where "myapp" is your web application sub-directory). Typically, the directory structure for a Cocoon applications looks like:
c:/tomcatwebappsmyapp
csv2html.xsl <-- an XSLT stylesheet
data1.csv <--
a data file
example.html <-- a plain HTML file
logo.jpg <--
a JPEG image
sitemap.xmap <-- the root sitemap file
WEB-INF <--the web application hidden
sub-directory
classes <-- subdirectory contains custom
Java components
cocoon.xconf<-- the Cocoon configuration file
lib   <-- library of JAR files for Cocoon et al.
logkit.xconf <-- a Cocoon logger configuration file
web.xml <-- the web application deployment
descriptor file
This is a typical directory layout for most J2EE web applications. Indeed, the deployment descriptor in the web.xml file is a standard J2EE web.xml file that maps ALL incoming URL requests to the Cocoon servlet. Other resources including Java Server Pages (JSPs), XML files, images, etc. can be used within the Cocoon application by editing the web.xml deployment descriptor if you desire.
Note that you can edit the sitemap file directly (sitemap.xmap) without stopping and restarting the J2EE server. Cocoon will recognize the change in the sitemap file. This capability adds a powerful level of debugging, experimentation and flexibility to Cocoon.
The sitemap in Figure 2 associates the seemingly
rather spare URL "index.html" with the given pipeline, but if the
application is deployed in a standard J2EE server within a web application called
"myapp" on the local server, then the resource is most likely available
from a browser at the following URL:
http://localhost:8080/myapp/index.html
A sample Cocoon application that accompanies this article is available at http://www.sphere.com/docs/myapp.zip for download and installation on Win32 and UN*X platforms. The example is designed to run in Apache Tomcat 4 or above, but should also run in other J2EE containers including Resin, Websphere, WebLogic, JRun, CFMX, and jBoss.
Mixing in Non-Cocoon Content
What if you just want to serve plain HTML files from the root directory? In that case, you'll need to use a "reader" to simply serve output directly to the requesting client. To be precise, a Cocoon pipeline can either be composed of:
- a generator, optionally followed by one or more transforms, ending with a serializer, OR
- a single reader, OR
- a redirect, OR
- an aggregator, optionally followed by one or more transforms, ending with a serializer
To serve HTML files directly from the root directory of the web application, first add a "reader" component in the map:components section of the sitemap as follows:
...
</map:serializers>
<map:readers default="resource">
<map:reader logger="sitemap.reader.resource" name="resource"
pool-max="32"
src="org.apache.cocoon.reading.ResourceReader"/>
</map:readers>
<map:matchers default="wildcard">
...
Next, add the following map:pipeline sub-section below the map:pipeline for the "index.html" URL but within the map:pipelines section as follows:
<map:pipelines>
<map:pipeline>
<map:match pattern="index.html">
<map:generate src="data1.csv"/>
<map:transform src="csv2html.xsl"/>
<map:serialize/>
</map:match>
<map:match pattern="*">
<map:read src=""/>
</map:match>
</map:pipeline>
The "*" pattern will catch ALL root level resource requests. The "" refers to the string that matches the pattern in the map:match pattern. URL patterns specified within map:match elements within map:pipeline sub-sections are matched from top-down. In this case, everything EXCEPT "index.html" is caught by the last map:match sub-section of the map:pipeline. Try requesting the "example.html" resource from the example application associated with this article to see the wildcard matcher in action.
May I Re-Direct Your Attention?
What if the URL request does not include the "index.html" resource explicitly? In other words, if the request is simply
http://localhost:8080/myapp/
then there is no associated pipeline and Cocoon returns an error page (more on error handling in Part II of this article). We can use a map:redirect-to element within a map:match with an empty (null) pattern as follows:
<map:pipelines>
<map:pipeline>
<map:match pattern="">
<map:redirect-to uri="index.html"/>
</map:match>
<map:match pattern="index.html">
...
</map:match>
...
</map:pipeline>
</map:pipelines>
The null pattern will catch the case of an empty URL and redirect the client's browser to the new resource.
Summary
We have only scratched the surface of the flexibility of Apache Cocoon in this article. There are many different types of generators, transformers, serializers, matchers, pipelines, and readers. In Part II of this article, we will explore more advanced concepts of Cocoon including error handling, aggregators, actions, views, and how to write your own Cocoon components (like the CSVGenerator above).
So here's the case for Cocoon. It handles content scalability problems at a web site level rather than within each web page. Furthermore, Cocoon can act as a flexible front-end publishing system for many content management systems (CMS). A CMS can be used to organize the content so that Cocoon pipelines can be configured to publish that information in many formats, for many devices, and in many languages.
Please download the accompanying example to explore the power of Apache Cocoon for yourself on any J2EE platform. The example is available at http://www.sphere.com/docs/myapp.zip for UN*X, Windows NT, Windows 2000, Windows XP platforms.


