Tei@DHSI 4 — Metadata and the teiHeader

Till Grallert

2 Jun 2015

Topic: TEI Metadata

The slides are based on those supplied by the various Digital Humanities Summer Schools at the University of Oxford under the Creative Commons Attribution license and have been adopted to the needs of the 2015 Introduction to TEI at DHSI.

Slides were produced using MultiMarkDown, Pandoc, Slidy JS, and the Snippet jQuery Syntax highlighter.

What is metadata?

General purposes of metadata

TEI metadata

TEI requires metadata to be stored inside the XML document, prefixed to the content. This information comprises the TEI header although, as we will see, some can be included inside the <body>.

The TEI header: <teiHeader>

The TEI header was designed with two goals in mind

The result is that discussion of the header tends to be pulled in two directions…

The librarian’s header

Everywoman’s header

TEI header structure

The TEI header has four main components:

Note: Only <fileDesc> is required; the others are optional.

Example header: minimal required header

<teiHeader>
    <fileDesc>
        <titleStmt>
            <title>A title?</title>
        </titleStmt>
        <publicationStmt>
            <p>Who published?</p>
        </publicationStmt>
        <sourceDesc>
            <p>Where from?</p>
        </sourceDesc>
    </fileDesc>
</teiHeader>

Two levels of TEI headers

Corpus header example

<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <!-- some metadata relating to the file, the letters etc. are kept in -->
        <fileDesc>
            <titleStmt>
                <title>PRO FO 618/3, Despatches to Constantinople, quarterly report, Hedjaz railway, reports on local politics etc., 1908</title>
            </titleStmt>
            <!-- ... -->
        </fileDesc>
        <!-- ... -->
    </teiHeader>
    <TEI xml:id="ProFo_618-3_Damascus_19081001_1" xml:lang="en">
        <teiHeader>
            <!-- metadate relating to the individual letter etc. -->
        </teiHeader>
        <facsimile>
            <!-- links to image files -->
        </facsimile>
        <text>
            <!-- transcription of the document -->
        </text>
    </TEI>
    <!-- More <TEI>elements -->
</teiCorpus>

Types of content in the TEI header

Example source: Kawkab America #55, 28 Apr 1908

Kawkab America #55, 28 Apr 1893, p.1 (English)

Kawkab America #55, 28 Apr 1893, p.1 (English)

Kawkab America #55, 28 Apr 1893, p.1 (Arabic)

Kawkab America #55, 28 Apr 1893, p.1 (Arabic)

Example header: minimal required header

<teiHeader xml:lang="en">
    <fileDesc>
        <titleStmt>
            <title level="j">Kawkab America</title>
            <title type="sub">Digital Edition</title>
        </titleStmt>
        <publicationStmt>
            <p>Unpublished example edition for the DHSI 2015</p>
        </publicationStmt>
        <sourceDesc>
            <p>Transcribed from digitised microfilm copies provided by the Center for Research Libraries</p>
        </sourceDesc>
    </fileDesc>
</teiHeader>

File description <fileDesc> 1

File description <fileDesc> 2

Title and responsibility statements

Within <titleStatement>, you can repeat any of these elements as necessary, and document additional responsbilities with a generic <respStmt>

<titleStmt xml:lang="en">
    <title level="j">Kawkab America</title>
    <title type="sub">Digital Edition</title>
    <editor><persName>Dr. A. Arbeely</persName></editor>
    <editor><persName>N.J. Arbeely</persName></editor>
    <respStmt>
        <resp>created the TEI files</resp>
        <name>Till Grallert</name>
    </respStmt>
</titleStmt>

Edition and extent statements

The publication statement <publicationStmt>

Example: publication statement

<publicationStmt>
    <publisher>UVic</publisher>
    <distributor>Digital Humanities Summer Institute</distributor>
    <authority>Till Grallert</authority>
    <pubPlace>Victoria, BC, Canada</pubPlace>
    <date from="2015-06-01" to="2015-06-05">1-5 June 2015</date>
    <availability>
        <licence>Licensed with a <ref target="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution</ref> licence.</licence>
    </availability>
</publicationStmt>

Series statement: <seriesStmt>

These include

Example: series statement

<seriesStmt>
    <title level="s">Machine-Readable Texts for the Study of Indian Literature</title>
    <respStmt>
        <resp>ed. by</resp>
        <name>Jan Gonda</name>
    </respStmt>
    <biblScope unit="vol">1.2</biblScope>
    <idno type="ISSN">0 345 6789</idno>
</seriesStmt>

Notes statement: <notesStmt>

The optional <notesStmt> can contain notes on almost any aspect of the file or its contents:

<notesStmt>
    <note>Transcribed for DHSI 2015 TEI Workshop</note>
</notesStmt>

These notes can be short statements, or many parargaphs long. Take care to encode such information with more precise elements elsewhere in the TEI header, when such elements are available. For example, text types, such as reportage or detective stories, should be described under <profileDesc>

The source description statement: <sourceDesc>

All electronic works need to document their source, even ‘born digital’ ones! There are variety of elements you may draw from:

Example 1: Kawkab America #55, 28 Apr 1908

Kawkab America #55, 28 Apr 1893, p.1 (English)

Kawkab America #55, 28 Apr 1893, p.1 (English)

Kawkab America #55, 28 Apr 1893, p.1 (Arabic)

Kawkab America #55, 28 Apr 1893, p.1 (Arabic)

Example 1: <sourceDesc>

<sourceDesc>
    <biblStruct>
       <monogr>
          <title level="j" xml:lang="ar-Latn-x-ijmes">Kawkab Amīrkā</title>
          <title level="j" xml:lang="en">The Star of America</title>
          <title level="j" type="alternative" xml:lang="en">Kawkab America</title>
          <title level="j" type="sub" xml:lang="ar-Latn-x-ijmes">Jarīda siyāsiyya ʿilmiyya tijāriyya adabiyya</title>
          <editor xml:lang="ar-Latn-x-ijmes">
            <persName><addName type="title">Dr.</addName> <forename>Ibrāhīm</forename> <surname>Arbīlī</surname></persName></editor>
          <editor xml:lang="ar-Latn-x-ijmes">
            <persName><forename>Najīb</forename> <forename>Yūsuf</forename> <surname>Arbīlī</surname></persName></editor>
          <imprint>
             <publisher xml:lang="ar-Latn-x-ijmes">al-Maṭbaʿat al-Sharqiyya</publisher>
             <pubPlace xml:lang="ar-Latn-x-ijmes">Niyū Yūrk</pubPlace>
             <publisher xml:lang="en">The Oriental Publishing House</publisher>
             <pubPlace xml:lang="en">New York</pubPlace>
             <date notAfter="1894-04-14" notBefore="1893-04-15" xml:lang="en">1893-1894</date>
             <biblScope unit="volume">2</biblScope>
          </imprint>
       </monogr>
       <idno type="callNumber"><!-- ideally this source should have a referenceable identifier --></idno>
    </biblStruct>
</sourceDesc>

Association between header and text

By default everything asserted by a header is true of the text to which it is prefixed. This can be over-ridden:

Most components of the encoding description are declarable.

Encoding description: <encodingDesc>

<encodingDesc> groups notes about the procedures used when the text was encoded, either summarised in prose or within specific elements such as

Detailed notes in <encodingDesc> could be used to generate section of an editorial description.

Example: <encodingDesc> 1

<encodingDesc xml:lang="en">
    <projectDesc>
        <p>Creation of a digital corpus of Arabic newspapers from Beirut published in the aftermath of the Young Turk Revolution</p>
    </projectDesc>
    <editorialDecl>
        <correction>
            <p>Apparent errors have been marked as <sic>sic</sic> but correct readings are not provided</p>
        </correction>
        <hyphenation>
            <p>Hyphenation is not common to Arabic texts</p>
        </hyphenation>
    </editorialDecl>
</encodingDesc>

Example: <encodingDesc> 1

<encodingDesc>
    <classDecl>
        <taxonomy xml:id="part-of-speech">
            <category xml:id="adje">
                <catDesc>adjectives</catDesc>
                <category xml:id="AJ0">
                    <catDesc>adjective (unmarked) (e.g. GOOD, OLD)</catDesc>
                </category>
                <category xml:id="AJC">
                    <catDesc>comparative adjective (e.g. BETTER, OLDER)</catDesc>
                </category>
                <category xml:id="AJS">
                    <catDesc>superlative adjective (e.g. BEST, OLDEST)</catDesc>
                </category>
            </category>
            <category xml:id="AT0">
                <catDesc>article (e.g. THE, A, AN)</catDesc>
            </category>
            <!-- ... -->
        </taxonomy>
    </classDecl>
</encodingDesc>

The tagging declaration: <tagsDecl>

The <tagsDecl> records elements namespace, tag frequency, information about the usage of particular tags not specified elsewhere, and default text appearance in source

The <rendition> element

consider this example:

<tagsDecl>
    <rendition scheme="css" xml:id="r-center">text-align: center;</rendition>
    <rendition scheme="css" xml:id="r-small">font-size: small;</rendition>
    <rendition scheme="css" xml:id="r-large">font-size: large;</rendition>
</tagsDecl>

which you can easily point to from the text

<hi rendition="#r-center #r-large">this bit of text was large and centred</hi>

But compare:

<hi rend="large center">this bit of text was large and centred</hi>

The profile description: <profileDesc>

A collection of descriptions, categorised only as ‘non-bibliographic’. Default members of the model.profileDescPart class include:

Example <creation>

<creation>
    <date when="1918-05"/>
    <placeName>Ripon</placeName>
    <listChange ordered="true">
        <change xml:id="CHG-1">First stage, written in pencil in Owen's hand </change>
        <change xml:id="CHG-2">Second stage, revised in pencil in Owen's hand</change>
        <change xml:id="CHG-3">Fixation of the revised passages and further minor revisions by Owen using ink</change>
        <change xml:id="CHG-4">Addition of another stanza with a different ink, probably at a later stage</change>
    </listChange>
</creation>

Here <listChange> records stages in changes to the document. Further down, in <revisionDesc> the same element is used to record changes to the electronic file.

Language and character set usage

The <langUsage> element is provided to document usage of languages and writing systems in the text. Languages are identified by their ISO codes:

<langUsage>
    <language ident="ar">Arabic</language>
    <language ident="ar-Latn-x-ijmes">Arabic transcribed into Latin script following the IJMES conventions</language>
    <language ident="ar-Latn-EN">Arabic transcribed into Latin script following common English practices</language>
    <language ident="ar-Latn-FR">Arabic transcribed into Latin script following common French practices</language>
    <language ident="en">English</language>
    <language ident="fa">Farsi</language>
    <language ident="fa-Latn-x-ijmes">Farsi transcribed into Latin script following the IJMES conventions</language>
    <language ident="fr">French</language>
    <language ident="ota">Ottoman</language>
    <language ident="ota-Latn-x-ijmes">Ottoman transcribed into Latin script following the IJMES conventions</language>
    <language ident="tr">Turkish</language>
</langUsage>

The calendar description

Example:

<calendarDes>
    <calendar xml:id="cal_islamic">
        <p>Islamic <hi>hijrī</hi> calendar: lunar calendar beginning the Year with 1 Muḥarram.</p>
    </calendar>
    <calendar xml:id="cal_julian">
        <p>Reformed Julian calendar beginning the Year with 1 January.</p>
    </calendar>
    <calendar xml:id="cal_ottomanfiscal">
        <p>Ottoman fiscal calendar: a lunosolar calendar. It is based on the Old Julian calendar beginning the Year with 1 March and synchronised with <hi>hijrī</hi> year count every 33 years.</p>
    </calendar>
</calendarDes>

Classification Methods

<textClass> groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc. using one or more of the following ways:

This categorization applies to the whole text. For more fine grained classification, use @decls on e.g. a <div> element to point to applicable variation in header.

Detailed characterization of a text

<textDesc> provides a description of a text in terms of its ‘Situational parameters’, a description of the situation whithin which the text was produced or experienced.

<textDesc n="novel">
    <channel mode="w">print; part issues</channel>
    <constitution type="single"/>
    <derivation type="original"/>
    <domain type="art"/>
    <factuality type="fiction"/>
    <interaction type="none"/>
    <preparedness type="prepared"/>
    <purpose degree="high" type="entertain"/>
    <purpose degree="medium" type="inform"/>
</textDesc>

The participants description: <particDesc>

<particDesc> can just contain paragraphs of prose, or a more structured <person> element in <listPerson>

Example 1:

<particDesc xml:id="p2">
    <p>Female informant, well-educated, born in Shropshire UK, 12 Jan 1950, of unknown occupation. Speaks French fluently. Socio-Economic status B2 in the PEP classification scheme </p>
</particDesc>

Example 2:

<particDesc xml:lang="en">
    <listPerson>
       <head>People mentioned in the text</head>
       <person xml:id="pers_1">
          <persName xml:lang="ar">عبد الحميد الثاني</persName>
          <persName xml:lang="en">Abdulhamid II</persName>
          <birth>
             <date calendar="#cal_islamic" datingMethod="#cal_islamic" when="1842"
                when-custom="1258-08-16" xml:lang="ar">١٦ شعبان ١٢٥٨ </date>
          </birth>
          <death>
             <date when="1918" xml:lang="ar">١٩١٨</date>
          </death>
          <idno type="GND">118646435</idno>
          <idno type="VIAF">9880442</idno>
          <state calendar="#cal_ottomanfiscal" datingMethod="#cal_ottomanfiscal"
             notBefore-custom="1876-06-18" xml:lang="en">
             <p>Sultan of the Ottoman Empire, 1876-1909</p>
          </state>
          <note xml:lang="en">
             <ref target="https://en.wikipedia.org/wiki/Abdul_Hamid_II">Wikipedia
                article</ref>
             <ref target="http://d-nb.info/gnd/118646435">PND</ref> 
          </note>
       </person>
       <person xml:id="pers_2">
          <persName xml:lang="ar"><addName type="title">الدكتور</addName> <forename>ابراهيم</forename> <surname>عربيلي</surname></persName>
          <persName xml:lang="en"><addName type="title">Dr.</addName> <forename>Abraham</forename> <surname>Arbeely</surname></persName>
          <state from="1892-04-15" xml:lang="en">
             <p>Editor of <orgName>Kawkab America</orgName>.</p>
          </state>
       </person>
       <person xml:id="pers_3">
          <persName xml:lang="ar">
             <forename>نجيب</forename> <forename>يوسف</forename> <surname>عربيلي</surname></persName>
          <persName xml:lang="en">
             <forename>Najeeb</forename> <forename>Joseph</forename> <surname>Arbeely</surname></persName>
          <state from="1892-04-15" xml:lang="en">
             <p>Editor of <orgName>Kawkab America</orgName>.</p>
          </state>
       </person>
    </listPerson>
</particDesc>

Revision Description: <revisionDesc>

Example:

<revisionDesc>
    <change when="2015-02-07">Added detailed profileDesc containing information on languages, calendars, and persons</change>
    <change when="2015-02-06">Added mark-up</change>
    <change when="2015-02-05"><persName>Till Grallert</persName> Created file</change>
</revisionDesc>

Some more metadata acronym soup

TEI provides a richer vocabulary than EAD or DCMI, and is less abstract than RDF or METS