Technical background to the EPU site

Emblem Project Utrecht

Technical background to the EPU site

1 Introduction

This page briefly describes the technical background of this website. We will give an introduction into the techniques we have been applying and their wider possibilities, followed by some details of the procedures followed.

1 Introduction
2 XML, TEI, XSLT, HTML
3 EPU Use of These Techniques
- 3.1 Rendering XML

2 XML, TEI, XSLT, HTML

2.1 HTML

Web pages are basically built using a markup language called HTML, the Hypertext Markup Language. HTML provides the framework for webpages, in conjunction other techniques or languages can be used to enhance its possibilities.

However, HTML has some important limitations:

HTML pages are static; there is no inherent mechanism to interact with the page's viewer except for taking them to another page.
HTML is geared towards presentation of text, pictures, etc.
For example, there is no way for a browser program (Netscape or Internet Explorer) to know that a few short lines following each other may, taken together, be the lines of a single poem. As far as the browser is concerned, they might be the ingredients for a recipe or a list of foreign cities. As a result, in order to change the way a website looks after it has already been constructed, the authour has no choice but to laboriously edit the pages. Also, if the authour decides a first-line index to the poetry website is needed, then it has to be built manually. This involves going into the pages, getting the first lines of the poems, putting them into another page and then creating the links to make the index work.
Much of HTML's development was done by competing software companies. There is a formal specification for HTML which browsers should adhere to however, a web page viewed with Netscape may look quite different than the same page viewed with Internet Explorer.

Of course, since the inception of the World Wide Web, a lot of work has been done to take away these limitations with the use of other software.

Script languages like Javascript or Visual Basic can add dynamic features to web pages (however, many script languages remain vendor-specific).
Add-on products such as Flash can be used to add all kinds of modern effects to web pages. However, these products do require the user to install extra software and they do not help in separating presentation or content.
Content-management software may be used to separate form and content. These programs work by storing web page’s content in a database and adding the presentational logic when a user asks for a certain page. Note, that many of these packages will store the page’s content in a proprietary format.
Using Java full-blown applications may be run from inside the browser window.
Finally, many techniques exist for web pages to interact with database systems. The dynamic part of a page’s content can be found in the database, while the static part can come from a content-management system.

2.2 XML

XML, or eXtensible Markup Language, can be seen as the main ingredient in trying to overcome HTML’s limitations. It was designed with the following principles in mind:

It separates form and content: an XML document should not contain information about the documents presentation or rendering. In fact, the same XML document may be rendered in many widely different ways (on the web, as a PDF document suitable for Acrobat Reader, or even on a mobile phone).
This does not preclude an XML document from containing rules for presentation of some kind. But in that case, the XML document should contain nothing but these rules; the content which these rules is applied to will be found in a separate document.
A XML document should describe its contents, this means that besides the text, it should contain markup much like HTML. For example, the markup will say that 'this is a recipe', or even within the recipe, 'this is an ingredient' and 'this is the amount of the ingredient you will need'.
Documents which are built this way lend themselves to automatic processing. Taking the recipe example: software can build a list of the ingredients for several recipes, if the self-describing information in the document clearly indicates which parts of the text contain the ingredient information.
It should be based on public standards.
XML files should also be platform-independent.
It should be extensible because it is impossible to define all the kinds of information people may want to store. There should be a mechanism which allows people and organisations to define new vocabularies, as the need arises.

An XML vocabulary designed for a specific terrain is technically known as a Document Type Definition (DTD). DTD’s are files with a syntax of their own, these describe the kind of markup allowed within XML files. These XML files are then said to conform to the DTD.

Recently, there has been a move away from DTD’s, which are being replaced by 'schemas'. Schemas can express even more constraints than DTD’s can, but they still have the advantage of being XML documents themselves.

XML has, in a short time, become very popular as a storage and interchange format for text and data. It is used on the World Wide Web, in conventional programming and/or in publishing environments. It has been endorsed as a recommendation by the W3C, the body which establishes WWW-standards.

XML’s capacities and popularity provide sound reasons, to use its encoding techniques in building text corpora for the web.

2.3 Text Encoding Initiative

As the name implies, XML was designed to be extensible. Such an extension has been provided by the Text Encoding Initiative, which has built an XML vocabulary for text markup in the humanities. The vocabulary focuses on encoding structural, interpretative and grammatical features of widely disparate text types. For example, it is used in dictionaries, novels, and poetry, to name a few. Originally developed for SGML, the vocabulary has been used since the beginning of the 1990’s in a wide range of European and American Universities and libraries.

Though from its inception TEI was primarily oriented towards capturing text features, the Guidelines also fully allow for describing and indexing image material. However, for the present time,there are no schema’s available for TEI encoding.

2.4 XSLT

After encoding a text using XML, something more is needed to make the results available to a reader. After all, the raison d’être of XML is the desire to separate content and form. A way to transform the XML document into a web page(or any other kind of document), is clearly needed.

Though there are many ways to accomplish this, the easiest way is to use XSLT, or eXtensible Stylesheet Language Transformations. Using XSLT (itself an XML-format) rules may be specified which define transformations to be applied to the contents of an XML document. The result is that a transformation is either a new XML document, a plain text file or an HTML document which is suitable for viewing over the web.

The word 'stylesheet' may suggest that XSLT is something like the Cascading Stylesheet Language used to define styles for web documents. However, XSLT is much more powerful than that. It is a complete programming language which allows for counting, sorting, selecting and changing any part of the XML document.

2.5 HTML Once More

There are several ways to configure this transformation process. The XML document may be sent to the web page viewer; the transformation into HTML can then be done by the browser program, which applies an XSLT-stylesheet. Another possible configuration, is that the web server will load the XML file and transform it into HTML when the viewer chooses a certain page. In the simplest configuration the transformations will be run only once, resulting in HTML files which may then be stored and handled like any other HTML file.

These configurations have their advantages and disadvantages. Transforming XML in the browser will only work if the user has a browser installed which can handle XML and XSLT stylesheets. At present there are no browsers which have out-of-the-box correct XSLT handling. Dynamic transformation on the server, upon request, presupposes software installed on the server. Static, one time only, transformation has less flexible results than one might be looking for. In this project, we are working with software (Cocoon) that transforms our XML-data ’on the ly’, meaning that our server provides dynamic generated HTML-files for any visitor on our site.

3 EPU Use of These Techniques

3.1 Rendering XML

At the Emblem Project Utrecht, we have decided to use XML and the TEI vocabulary to encode our editions of the emblem books. The HTML-pages are generated using Cocoon, the search option of this site runs on Lucene.

All EPU-files can be found elsewhere on this site, see the option 'Project' in the top menu.

Emblem Project Utrecht