TEI Meeting: Day 1

TEI Meeting: Day 1 Recap
This week I’m attending the Text Encoding Initiative‘s annual meeting (TEI@20). The TEI is, at heart, a scholarly effort to develop a tag-set for encoding, or marking up, documents in the humanities. Documents in this case are quite broadly defined to include books, manuscripts, music, even physical objects like gravestones. Once encoded, these digital documents can then be displayed on the web or analyzed and reused in a variety of ways. This tag set, described in “The Guidelines,” is considered extensible, that is, it can be extended to allow for description of specific types of documents, for example, letters. It has had an impact on both other standardized tag sets and even on the creation and development of XML itself. In fact, some of the TEI creators were heavily involved in the creation of XML.
The TEI is 20 years old this year, which means it predates XML, predates the web, and even predates UVM’s connection to the Internet. Despite its age the TEI still manages to be on the cutting edge of digitization efforts. This year the TEI debuts “P5,” a reconceptualization of the tag set and guidelines that takes advantage of recent developments in XML, especially schemas. P5 is even more modular, more open, more flexible than the previous version. It also incorporates more tags, and adjusts some tags to more closely align to developing ISO and W3c standards.
The first day was devoted to a workshop introducing P5. While the overview of new tags was welcome, the most interesting part of the day for me was the all too brief section on the new modular structure of P5 schemas. This has been a source of confusion. A key principle of the TEI is that you should only use those tags that you need. Some tags would be “core” i.e. needed by all documents, while others would be optional. The TEI called this the “Pizza” method (all pizzas need a crust but not all pizzas need pepperoni) and provided tools for choosing which “toppings” should be built into a DTD. This DTD would then control which tags could be used to markup the document. Unfortunately, the DTD itself is not written in XML. Schemas, on the other hand, serve the same function but are written in XML so can be modularized and processed as any XML file–a fact which makes for some interesting possibilities.
With the move to schemas, however, comes a challenge. For example, if you use the XML editor, OxygenXML, it comes with a library of “frameworks” that include both TEI P4 and P5 DTDs and schemas. These schemas are expressed as a series of modules. Choosing which modules to use, and how to combine them, and especially how to arrange them so that you can use them locally and from a server, is not trivial. Fortunately, the TEI has also created ROMA, a tool that allows one to choose which modules are needed, them combines them into a single RELAX NG file. Easy. The rest is just getting familiar with the tag sets. Of course, the current set numbers over 400, so there’s plenty to learn.

Leave a Reply

Archives

Meta