Tei@DHSI 1 — Introduction to Markup, XML, and the TEI

Till Grallert

2 March 2015

Introduction to Markup, XML, and TEI

The slides are based on those supplied by the various Digital Humanities Summer Schools at the University of Oxford under the Creative Commons Attribution license and have been adopted to the example of Arabic newspapers.

Slides were produced using MultiMarkDown, Pandoc, Slidy JS, and the Snippet jQuery Syntax highlighter.

Textual Markup

In order to talk about texts, markup and encoding of texts, we need to understand what we mean by these basic concepts.

When we talk about text encoding, what do we mean by a text? What is in a text and what assumptions do we make in reading them?

What is a text?

Ḳānūn-i Esāsī in al-Bashīr, 3 August 1908

Ḳānūn-i Esāsī in al-Bashīr, 3 August 1908

Ḳānūn-i Esāsī in Thamarāt al-Funūn, 27 July 1908

Ḳānūn-i Esāsī in Thamarāt al-Funūn, 27 July 1908

Ḳānūn-i Esāsī in Lisān al-Ḥāl, 27 July 1908

Ḳānūn-i Esāsī in Lisān al-Ḥāl, 27 July 1908

Ḳānūn-i Esāsī in al-Jinān, 15 January 1877

Ḳānūn-i Esāsī in al-Jinān, 15 January 1877

Ḳānūn-i Esāsī, 1876

Ḳānūn-i Esāsī, 1876

Ḳānūn-i Esāsī in Sālnāme-yi Vilāyet-i Sūriye #24, 1892

Ḳānūn-i Esāsī in Sālnāme-yi Vilāyet-i Sūriye #24, 1892

Three Arabic translations of the Ottoman constitution in a digital parallel edition

Three Arabic translations of the Ottoman constitution in a digital parallel edition

Original and latinised text of the Ottoman constitution in a digital parallel edition with an Arabic translation

Original and latinised text of the Ottoman constitution in a digital parallel edition with an Arabic translation

A text is not a document

Where is the text?

TEI’s definition:

Encoding of texts

Only that which is explicit can be reliably found again and displayed

What is the point of markup?

We don’t have to be limited to the view of one editor or consumer

Styles of markup

Some more definitions

Separation of form and content

Markup as scholarly activity

Compare markup

Example 1:

<hi rend="dropcap">H</hi>&WYN;ÆT WE GARDE <lb/>na in gear-dagum þeod-cyninga <lb/>þrym gefrunon, hu ða æþelingas <lb/>ellen fremedon. oft scyld scefing sceaþe <add>na</add>
<lb/>þreatum, moneg<expan>um</expan> mægþum meodo-setl <add>a</add>
<lb/>of<damage>
<desc>blot</desc> </damage>teah ...

Example 2:

<lg>
    <l>Hwæt! we Gar-dena in gear-dagum</l>
    <l>þeod-cyninga þrym gefrunon,</l>
    <l>hu ða æþelingas ellen fremedon,</l>
</lg> 
<lg>
    <l>Oft Scyld Scefing sceaþena þreatum,</l>
    <l>monegum mægþum meodo-setla ofteah;</l>
    <l>egsode Eorle, syððan ærest wearþ</l>
    <l>feasceaft funden...</l>
</lg>

A useful mental exercise

Imagine you are going to markup several thousand pages of complex material….

Now, imagine your budget has been halved. Repeat the exercise!

Some alphabet soup

abbr expan
SGML Standard Generalized Markup Language
HTML Hypertext Markup Language
W3C World Wide Web Consortium
XML eXtensible Markup Language
DTD Document Type Definition (or Declaration)
CSS Cascading Style Sheet
Xpath XML Path Language
XSLT eXtensible Stylesheet Language - Transformations
XQuery XML Querying
RELAXNG Regular Expression Language for XML (New Generation)

… and then there’s also TEI, the Text Encoding Initiative

XML

Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML also now plays an indispensible role in the exchange of a wide variety of data on the Web and elsewhere.

XML: what it is and why you should care

XML terminology 1

An XML document may contain:

XML terminology 2

The rules of the XML Game

Representing an XML tree

Parts of a real XML document

<?xml version="1.0"?>
<greetings xmlns="http://www.example.org/greetings">
    <hello type="enthusiastic">hello world!</hello>
</greetings>

The XML declaration

An XML document must begin with an XML declaration which does three things:

Example:

<?xml version="1.0" ?>
<?xml version="1.0" encoding="iso-8859-1" ?>

Declaring namespaces

All TEI documents are declared within the TEI namespace — a way of distinguishing one set of elements from another with the same names (like <p>):

<TEI xmlns="http://www.tei-c.org/ns/1.0"> ... </TEI>

XML documents can include elements declared in different namespaces.

Example:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:math="http://www.mathml.org">
<p>...
    <math:expr>...</math:expr>
    ...</p>
</TEI>

The xml namespace is used by the TEI for global attributes @xml:id and @xml:lang

Example: Thamarāt al-Funūn #1683, 27 July 1908

<?xml version="1.0" encoding="UTF-8"?>
<div xmlns="http://www.tei-c.org/ns/1.0" type="article" xml:lang="ar">
    <head xml:lang="ar">الصدارة العظمى</head>
    <p>بشرة الانباء البرقية بصدور الارادة 
    <lb/> السنية السلطانية باسناد مسند الصدارة 
    <lb/>العظمى الى عهدة الوزير الخطير <persName> حضرة 
    <lb/> صاحب الفخامة الدولة سعيد باشا</persName>الصدر 
    <lb/> الأعظم الأسبق، فسر الكل بهذا التوجيه 
    <lb/> الوجيه لما عرف به الصدر المشار اليه من 
    <lb/> التفاني في خدمة الجناب العالي السلطاني 
    <lb/> بنبالة قصد وسعة علم مع اقتدار باهر 
    <lb/> واخلاص عظيم فنرفع لفخامته فروض 
    <lb/> التهاني والتبريك ونضرع الى المولى المتعال 
    <lb/> ان يقرن اموره بالتوفيق وتنفيذ نيات حضرة 
    <lb/> مولانا الخليفة الاعظم المنصرف في خير 
    <lb/> العباد وعمران البلاد،</p>
    <p>وهذا تعريب التلغراف السامي الوارد 
    <lb/> من فخامته الى مقام الولاية الجليلة:</p>
    <p>
        <quote>سنحت عواطف الحضرة العلية 
        <lb/> السلطانية بتوجيه خدمة الصدارة هذه 
        <lb/> المرة ايضاً على عهدة هذا المثنى . ان 
        <lb/> وظيفة مأموري المملكية الاساسية هي المحافظة
        <lb/> على الامن والراحة وحسن رؤْية المصالح 
        <lb/> ووظائف الادارة العدلية والحكام هي
        <lb/> اجراء العدالة في الحقوق العمومية والشخصية 
        <lb/> ضمن دائرة القوانين العدلية الموضوعة مع 
        <lb/> قيام كل مأمور بوظيفة مأموريته بكمال 
        <lb/> العفة والاستقامة لهذا نخطركم ان الذين 
        <lb/> يقومون بالخدمات الصحيحة الصادقة تكون 
        <lb/> مساعيهم مظهراً لشرف تقدير <rs ref="#pers-1"> حضرة 
        <lb/> صاحب الخلافة العظمى</rs>كما ان من يخالف 
        <lb/>ذلك يقع بالطبع تحت ظائلة المسؤلية 
        <lb/> في <date calendar="#cal_ottomanfiscal"
                datingMethod="#cal_ottomanfiscal" when-custom="1324-05-09">٩ تموز سنة
            ٣٢٤</date></quote>
    </p>
</div>

Example deconstructed: root node

<?xml version="1.0" encoding="UTF-8"?>
<div type="article" xml:lang="ar">
<!-- ... -->
</div>

Example deconstructed: head

<head xml:lang="ar">الصدارة العظمى</head>

Example deconstructed: paragraph, quote, and date

<p>
    <quote>سنحت عواطف الحضرة العلية 
    <lb/> السلطانية بتوجيه خدمة الصدارة هذه 
    <lb/> المرة ايضاً على عهدة هذا المثنى . ان 
    <lb/> وظيفة مأموري المملكية الاساسية هي المحافظة
    <lb/> على الامن والراحة وحسن رؤْية المصالح 
    <lb/> ووظائف الادارة العدلية والحكام هي
    <lb/> اجراء العدالة في الحقوق العمومية والشخصية 
    <lb/> ضمن دائرة القوانين العدلية الموضوعة مع 
    <lb/> قيام كل مأمور بوظيفة مأموريته بكمال 
    <lb/> العفة والاستقامة لهذا نخطركم ان الذين 
    <lb/> يقومون بالخدمات الصحيحة الصادقة تكون 
    <lb/> مساعيهم مظهراً لشرف تقدير <rs ref="#pers-1"> حضرة 
    <lb/> صاحب الخلافة العظمى</rs>كما ان من يخالف 
    <lb/>ذلك يقع بالطبع تحت ظائلة المسؤلية 
    <lb/> في <date calendar="#cal_ottomanfiscal"
            datingMethod="#cal_ottomanfiscal" when-custom="1324-05-09">٩ تموز سنة
        ٣٢٤</date></quote>
</p>

XML syntax: the small print

What does it mean to be well-formed?

  1. There is a single root node containing the whole of an XML document
  2. Each subtree is properly nested within the root node
  3. Element/attribute/etc. names are always case sensitive
  4. Start-tags and end-tags are always mandatory (except there is a combined start-and-end tag, e.g. <pb/>)
  5. Attribute values are always quoted

A file can be valid in addition to being well-formed. This means you obey the rules of a specified schema, such as the TEI.

Test your XML knowledge

Which are correct?

 <seg>some text</seg>
 <seg> <foo>some</foo> <bar>text</bar> </seg>
 <seg> <foo>some <bar></foo> text</bar> </seg>
 <seg type="text">some text</seg>
 <seg type='text'>some text</seg>
 <seg type=text>some text</seg>
 <seg type="text"> some text <seg/>
 <seg type="text"> some text<gap/> </seg>
 <seg type="text">some text</Seg>

XML is an international standard

(The @xml:id attribute is another W3C-defined attribute.)

The TEI

The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts chiefly in the humanities, social sciences and linguistics.

1987 was a long time ago…

The Text Encoding Initiative was born into a very different world

…but also a familiar problems

The birth of the Text Encoding Initiative

TEI is old!

Why the TEI

The TEI provides

Relevance

Why would you want those things?

The scope of intelligent markup

Even within the original scope of the TEI we have

Reasons for attempting to define a common framework

The TEI was designed to support multiple views of the same resource. The TEI is an evolving model of the concerns of Digital Humanities.

TEI adopted XML

In 2002, the TEI consortium published the P4 Guidelines, which were essentially an adaptation of P3 to XML that had been finalised as W3C standard in 1998.

P5, a complete overhaul of the guidelines, was published in 2008. Updates are regularly published every half a year ever since.

The Guidelines are currently maintained as an open source project on the Sourceforge site http://tei.sf.net/, from which released and development versions may be freely downloaded.

TEI XML

Note: namespaces vs schemas

Conformance issues

A document is TEI Conformant if and only if it:

or if it can be transformed automatically using some TEI-defined procedures into such a document (it is then considered TEI-conformable).

A final note on standardization

Standardization should not mean “Do what I do”, but rather “Explain what you do in terms I can understand”.

Instead of an abstract set of rules and norms, standardisation should be thought of as a community of practice.