Tei@DHSI 2 — TEI core module

Till Grallert

2 March 2015

TEI core module: Introducing structural markup

The slides are based on those supplied by the various Digital Humanities Summer Schools at the University of Oxford under the Creative Commons Attribution license and have been adopted to the example of Arabic newspapers.

Slides were produced using MultiMarkDown, Pandoc, Slidy JS, and the Snippet jQuery Syntax highlighter.

al-Iqbāl

Front page of al-Iqbāl #257, 27 July 1908

Front page of al-Iqbāl #257, 27 July 1908

al-Bashīr

Front page of al-Bashīr #1868, 27 July 1908

Front page of al-Bashīr #1868, 27 July 1908

Front page of al-Bashīr, 3 August 1908

Front page of al-Bashīr, 3 August 1908

Lisān al-Ḥāl

Front page of Lisān al-Ḥāl #5773, 27 July 1908

Front page of Lisān al-Ḥāl #5773, 27 July 1908

Thamarāt al-Funūn

Front page of Thamarāt al-Funūn #1683, 27 July 1908

Front page of Thamarāt al-Funūn #1683, 27 July 1908

Looking at the material, what do we need to mark up?

The document structure

All TEI documents are structured in a particular manner. This section attempts to describe the different variations on this as briefly as possible.

Structure of a TEI Document

There are two basic types of TEI document:

The text may be in the form of:

TEI basic structure

<TEI xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <!-- required -->
    </teiHeader>
    <facsimile>
        <!-- optional-->
    </facsimile>
    <sourceDoc>
        <!-- optional -->
    </sourceDoc>
    <text>
        <!-- required if no facsimile or sourceDoc-->
    </text>
</TEI>

TEI basic structure 2

<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <!-- required -->
    </teiHeader>
    <TEI>
        <!-- required -->
    </TEI>
    <!-- More <TEI>elements -->
</teiCorpus>

The <text> element

What is a text? (remember that one?)

TEI text structure 1

A simple document:

<text>
    <front>
        <!-- optional -->
    </front>
    <body>
        <!-- required -->
    </body>
    <back>
        <!-- optional -->
    </back>
</text>

Macrostructure: composite texts

Newspaper issues are usually grouped into volumes (or years). If we consider them as a single composite text, we could treat each issue as a <div> within it. Or (even better) we could use the <group> element:

<text n="35" xml:id="v35" xml:lang="ar">
    <front>
        <!-- some introductory material for the current volume -->
    </front>
    <group>
        <text n="1869" xml:id="v35-i1869" xml:lang="ar">
            <front>
                <!-- the masthead of issue 1869 -->
            </front>
            <body>
                <!-- issue 1869 -->
            </body>
        </text>
        <text n="1870" xml:id="v35-i1870" xml:lang="ar">
            <front>
                <!-- the masthead of issue 1870 -->
            </front>
            <body>
                <!-- issue 1870 -->
            </body>
        </text>
    </group>
    <back>
        <!-- volume index, appendices etc. -->
    </back>
</text>

The high level structure

Each identifiable division within <text> is a <div> element. It can optionally be given a particular type (e.g. cartoon, verse, prose), using a free-text attribute.

For example, page 1 has two divisions:

<pb n="1"/>
<div type="article">
    <p>....</p> 
</div>
<div type="poem"> 
    <head>Strange Meeting</head> 
    <lg>
        <l>....</l> 
    </lg>
</div>

Why divisions rather than pages

Because a division can start on one page and finish on another, or cross other physical boundaries

We use an empty element <pb/> to mark the boundary between pages, rather than enclosing each page in a <div type="page">.

<pb n="5"/>
<div type="article">
    <p>...</p> 
</div>
<div type="poem">
    <head>Strange Meeting</head>
    <lg> ...
    <pb n="6"/>
    ...
    </lg>
</div>
<div type="article">
    <p>...</p>
</div>

Divisions can contain divisions …

<div type="postcard">
    <div type="postmark">
        <div type="advert">
            <ab>BUY NATIONAL <lb/>WAR BONDS</ab>
        </div>
        <div type="dateStamp">
            <dateline>
                <placeName>SCARBOROUGH</placeName>
                <lb/>
                <time>6.30 PM</time>
                <lb/>
            </dateline>
        </div>
        <div type="advert">
            <ab>BUY NATIONAL <lb/>WAR BONDS</ab>
        </div>
    </div>
    <div type="address">
        <!-- <address> here -->
    </div>
    <div type="prose">
        <!-- text here -->
    </div>
</div>

More about divisions

Tessellation

<div>s must tesselate over the entire text

<div1>
    <div2>
        <!-- content -->
    </div2>
    <div2>
        <!-- content -->
    </div2>
</div1>

is valid, while

<div1>
    <!-- content -->
    <div2>
        <!-- content -->
    </div2>
    <!-- content -->
</div1>

is not valid!

Divisions may have heads and trailer

<div>
    <head>Preface</head>
    <p>
        <!-- content of the div -->
    </p>
    <trailer>...</trailer>
</div>

Numbered and unnumbered divisions

The level can be made explicit by using ‘numbered’ divs (div1, div2). Opinions vary:

<div1> vs. <div n="1">

Groups vs floating texts

The <group> element should be used to represent a collection of independent texts which is to be regarded as a single unit for processing or other purposes.

<floatingText> contains a single text of any kind, whether unitary or composite, which interrupts the text containing it at any point and after which the surrounding text resumes.

Floating text example

The <floatingText> element can appear within any division level element in the same way as a paragraph.

<p>She was thus ruminating, when a Gentleman enter'd the Room, the Door being a jar... calling for a Candle, she beg'd a thousandPardons, engaged him to sit down, and let her know, what had so long conceal'd him from her Correspondence. </p>
<pb n="5"/>
<floatingText>
    <body>
        <head>The Story of <hi>Captain Manly</hi></head>
        <p>
            <!-- Captain Manly's store here -->
        </p>
    </body>
</floatingText>
<pb n="37"/>
<p>The Gentleman having finish'd his Story ... 
    <!-- more -->
</p>

Document order vs. XML order

The order of XML encoding does not necessarily reflect the order of the source document. Compare:

<div type="postcard">
    <div type="address">
        <!-- <address>here -->
    </div>
    <div type="prose">
        <!-- text here -->
    </div>
    <div type="postmark">
        <div type="advert">
            <ab>BUY NATIONAL <lb/>WAR BONDS</ab>
        </div>
        <div type="dateStamp">
            <dateline>
                <placeName>SCARBOROUGH</placeName>
                <lb/><time>6.30 PM</time>
                <lb/>
            </dateline>
        </div>
        <div type="advert">
            <ab>BUY NATIONAL <lb/>WAR BONDS</ab>
        </div>
    </div>
</div>

Core elements

The core module of the TEI groups together elements which may appear in any kind of text and the tags used to mark them in all TEI documents. This includes:

Paragraphs

<p>: paragraph; marks paragraphs in prose

Example

<p>ترجمة التلغراف السامي الوارد من مقام الصدارة العظمى
<lb/><quote>صدرت ارادة حضرة صاحب الخلافة العظمى
    <lb/>بان يدعى الى الاجتماع مجلس المبعوثان المبينة كيفية
    <lb/>تشكيله في القانون الاساسي الذي هو من تأسيس
    <lb/>حضرة الخليفة الاعظم وبما انه ابلغ حكم هذه الارادة
    <lb/>السنية الجليل الى جميع الولايات الشاهانية
    <lb/>المتصرفيات غير الملحقة فعليكم باجراء انتخاب اعضاء
    <lb/>حائزين الضفات المندرجة في <rs>القانون المذكور</rs></quote>
    في<date>١٠ تموز سنة ١٣٢٤</date></p>

Highlighting

By highlighting we mean the use of any combination of typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings. For words and phrases which are:

Example: Highlighting

Example

<calendar xml:id="cal_islamic">
    <p>Islamic  <hi rend="italics">hijrī</hi>calendar: lunar calendar beginning the Year with 1 <hi rend="italics">Muḥarram</hi>. Dates differ between locations as the beginning of the month is based on sightings of the new moon.</p>
    <p>E.g. 
        <date calendar="#cal_islamic" datingMethod="#cal_islamic" when="1841-05-23" when-custom="1257-04-01">1 Rab II 1257, Sunday</date>,   
        <date calendar="#cal_islamic" datingMethod="#cal_islamic" when="1908-03-05" when-custom="1326-02-01">1 Ṣaf 1326,
                        Thursday</date>.
        </p>
</calendar>

Quotation

Quotation marks can be used to set off text for many reasons, so the TEI has the following elements:

Example

 <quote>
    <bibl>ترجمة التلغراف السامي الوارد من مقام الصدارة العظمى</bibl>
    <lb/>صدرت ارادة حضرة صاحب الخلافة العظمى
    <lb/>بان يدعى الى الاجتماع مجلس المبعوثان المبينة كيفية
    <lb/>تشكيله في القانون الاساسي الذي هو من تأسيس
    <lb/>حضرة الخليفة الاعظم وبما انه ابلغ حكم هذه الارادة
    <lb/>السنية الجليل الى جميع الولايات الشاهانية
    <lb/>المتصرفيات غير الملحقة فعليكم باجراء انتخاب اعضاء
    <lb/>حائزين الضفات المندرجة في <rs>القانون المذكور</rs><
    bibl>في <date>١٠ تموز سنة ١٣٢٤</date></bibl>
</quote>

Lists

Example: simple list

<p>
    <hi>To which is added,</hi>A Collection of LETTERS of Friendship, and other Occasional LETTERS, written by
    <list>
        <item>
            Mr. 
            <hi>Dryden,</hi></item>
        <item>Mr. 
            <hi>Wycherly,</hi></item>
        <item>Mr.—</item>
        <item>Mr. 
            <hi>Congreve,</hi></item>
        <item>Mr. 
            <hi>Dennis,</hi>
            and other Hands.</item>
    </list>
</p>

Notes

Example:

<note place="foot">Painted by <persName>John Singer Sargent</persName>, 1.918</note>

Simple editorial changes: <choice> and friends

Example: choice 1

<dateline xml:lang="ar">
    <date calendar="#cal_julian" datingMethod="#cal_julian" when="1908-07-27" when-custom="1908-07-14">١٤ تموز 
        <choice>
            <abbr>ش</abbr>
            <expan>شرقي</expan>
        </choice>
    </date>
    <date>٢٧ 
        <choice>
            <abbr>غ</abbr>
            <expan>غربي</expan>
        </choice>سنة ١٩٠٨</date>
</dateline>

Example: choice 2

Consider: “Excuse me sir, but would you like to buy a nice little dawg?”

We can:

Example:

...a nice little <choice><orig>dawg</orig><reg>dog</reg></choice>?

Additions, Deletions, and Omissions

Example: additions, deletions, omissions

<p><add place="left">My </add>
    <del rend="stroked">It's </del>
    <add place="above">
        <del rend="stroked">The </del>
    </add>subject <del rend="stroked">of</del> is War, and the 
    <unclear>pity </unclear>
    of <del rend="stroked">it</del> War. 
    <lb/>The Poetry is in the pity.</p>

Basic names

The @type attribute is useful for categorizing these, and they both also have @key, @ref, and @nymRef attributes.

Addresses

Basic numbers and measures

Example: numbers and measures

<l>With a <num value="1000">thousand</num> pains that vision's face was grained;</l>
... only <measure type="distance" unit="m" quantity="3218.69">two miles</measure> from the front....

Dates

Example

<date when="1917-07">July 1917.<lb/> Wednesday</date>

Simple Linking

Example

See <ref target="#Section12">section 12 on page 34</ref>.
See <ptr target="#Section12"/>.
The <ref target="http://www.bbc.co.uk/">BBC web site</ref> has a good sports section

Indexing

Example

<p>Last week I wrote (to order) a strong <lb/>bit of Blank<index>
    <term>Verse</term>
    <index>
        <term>Blank Verse</term>
    </index>
</index>:</p>

Graphics

Example

<div type="article" xml:lang="ar">
    <head>تعريب الفرمان العالي السلطاني</head>
    <figure>
        <graphic url="#facs-2-1-z-1"/>
        <head xml:lang="en">The Ottoman Tughra</head>
        <figDesc>Reproduction of the Ottoman coat of arms / Sultanic seal</figDesc>
    </figure>
    <q>افتخار الاعلام والاعظام مختار الاكابر والافخم مستجمع جميع المعالي</q>
</div>
Ṭughrā at the head of the Qānūn al-Asāsī in Thamarāt al-Funūn, 27 July 1908

Ṭughrā at the head of the Qānūn al-Asāsī in Thamarāt al-Funūn, 27 July 1908

Simple verse

<lg type="stanza">
    <l>It seemed that out of battle I escaped</l>
    <l>Down some profound dull tunnel, long since scooped</l>
    <l>Through granites which titanic wars had groined.</l>
</lg>
<lg type="stanza">
    <l>Yet also there encumbered sleepers groaned, </l>
    <l>Too fast in thought or death to be bestirred. </l>
    <l>Then, as I probed them, one sprang up, and stared </l>
    <l>With piteous recognition in fixed eyes, </l>
    <l>Lifting distressful hands, as if to bless. </l>
    <l>And by his smile, I knew that sullen hall,--- </l>
    <l>By his dead smile I knew we stood in Hell.</l>
</lg>