Schema and customisation: producing valid TEI

The slides are based on those supplied by the various Digital Humanities Summer Schools at the University of Oxford under the Creative Commons Attribution license and have been adopted to the example of Arabic newspapers.

Slides were produced using MultiMarkDown, Pandoc, and the Slidy JS code of the W3C.

Customising the TEI

We will cover:

Every use of the TEI involves making use of a customisation of the TEI.

Terminology again

What is a module?

Which modules are available?

Module name Chapter of the P5
analysis Simple analytical mechanisms
certainty Certainty and responsibility
core Elements available in ALL TEI documents
corpus Language corpora
dictionaries Dictionaries
drama Performance texts
figure Tables, formulae, and graphics
gaiji Representation of non-standard characters and glyphs
header the TEI header
iso-fs Feature structures
linking Linking, segmentation, and alignment
msdescription Manuscript description
namesdates Names, dates, people, and places
nets Graphs, networks, and trees
spoken Transcription of speech
tagdocs Documentation elements
tei the TEI infrastructure
textcrit Critical apparatus
textstructure Default text structure
transcr Representation of primary sources
verse verse

How do you choose?

Here comes Roma acommand line script,with a web frontend, designed to make this process much easier

Roma: design a new schema

Screen shot of Roma

Screen shot of Roma

Roma: customise

Screen shot of Roma

Screen shot of Roma

Roma: schema

Screen shot of Roma

Screen shot of Roma

Roma: documentation

Screen shot of Roma

Screen shot of Roma

What did we just do?

We processed a pre-existing ODD file which contained (as well as some discursive prose) the following schema specification:

<schemaSpec ident="tei_bare" start="TEI">
    <moduleRef key="core"/>
    <moduleRef key="tei"/>
    <moduleRef key="header"/>
    <moduleRef key="textstructure"/>
    <elementSpec ident="abbr" mode="delete" module="core"/>
    <elementSpec ident="add" mode="delete" module="core"/>
    <!-- ... -->
    <elementSpec ident="trailer" mode="delete" module="textstructure"/>
    <elementSpec ident="title" mode="change" module="core">
            <attDef ident="level" mode="delete"/>
    <!-- ... -->

We selected four modules, deleted loads of elements, and also deleted an attribute.

Roma provides an interface to the detail

Roma: select modules

Screen shot of Roma

Screen shot of Roma

Roma: edit modules

Screen shot of Roma

Screen shot of Roma

What do we need for our newspapers?

A simple selection of elements, but also

Other constrains are possible--we might want to insist that a <div @type="bill"> contains only <div type="section"> and <div type="article"> and that the latter should be numbered through a @n attribute

The ODD advantage

We can express these constraints in our ODD meta-schema, and then generate a formal schema to enforce them using whichever schema language we like.

Roma: select attributes

Screen shot of Roma

Screen shot of Roma

Roma: constrain attribute values

Screen shot of Roma

Screen shot of Roma

What did we just do?

Our ODD now includes something like this:

<elementSpec ident="div" mode="change" module="textstructure">
        <attDef ident="type" mode="change" usage="req">
            <valList mode="replace" type="closed">
                <valItem ident="section"/>
                <valItem ident="article"/>
                <valItem ident="verse"/>
                <valItem ident="masthead"/>
                <valItem ident="bill"/>
                <valItem ident="letter"/>
                <!-- ... -->

Note that we can also add documentation to the ODD

<valItem ident="verse">
    <gloss>contains (parts of ) a poem</gloss>

Defining a new element

When defining a new element, we need to consider

The TEI class system helps us answer all these questions (except the first).

The TEI class system

TEI attribute classes a very important attribute class

All elements are usually members of; this class provides, among others:

Model Classes

Basic model class structure

Simplifying wildly, one may say that the TEI recognises three kinds of element:

There are ‘base model classes’ corresponding with each of these, and also with the following groupings:

And yes, there is a class for elements that can appear anywhere inside a text — at any hierarchic level.


Defining a new element

Roma: Defining a new element

Screen shot of Roma

Screen shot of Roma

Defining a content model

Roma: Defining a new element 2

What did we just do?

We added a new element specification to our ODD, like this:

<elementSpec ident="something" mode="add" ns="">
    <desc>contains something division like.</desc>
        <memberOf key="model.divPart"/>
        <memberOf key="att.typed"/>
        <rng:ref name="someThing"/> 
            <rng:ref name="model.pLike"/>

Note that this new element is not in the TEI namespace. It belongs to this specific project only!

Other kinds of constraints

Schematron constraints

An element specification can also contain a <constraintSpec> element which contains rules about its content expressed as ISO Schematron constraints

<elementSpec ident="div" mode="change" module="teistructure" xmlns:s="">
    <constraintSpec ident="div" scheme="isoschematron">
            <s:assert test="@type='bill' and .//tei:div[@type='article']">prose must include a paragraph</s:assert>

However... - You can only add such rules by editing your ODD file: Roma doesn't know about them. - Not all schema languages can implement these constraints.