Topic: TEI Metadata

The slides are based on those supplied by the various Digital Humanities Summer Schools at the University of Oxford under the Creative Commons Attribution license and have been adopted to the example of Arabic newspapers.

Slides were produced using MultiMarkDown, Pandoc, and the Slidy JS code of the W3C.

What is metadata?

General purposes of metadata

TEI metadata

TEI requires metadata to be stored inside the XML document, prefixed to the content. This information comprises the TEI header although, as we will see, some can be included inside the <body>.

The TEI header: <teiHeader>

The TEI header was designed with two goals in mind

The result is that discussion of the header tends to be pulled in two directions...

The librarian's header

Everywoman's header

TEI header structure

The TEI header has four main components:

Only <fileDesc> is required; the others are optional.

Example header: minimal required header

            <title>A title?</title>
            <p>Who published?</p>
            <p>Where from?</p>

Two levels of TEI headers

Corpus header example

<teiCorpus xmlns="">
    <!-- Add xmlns and version in <teiCorpus>-->
    <teiHeader type="corpus">
        <!-- corpus-level metadata here -->
        <!-- Must contain one TEI header for the corpus. -->
        <teiHeader type="text">
            <!-- metadata specific to this text here -->
            <!-- Must contain a series of TEI elements, one for each text. -->
            <!-- ... -->
        <teiHeader type="text">
            <!-- metadata specific to this text here -->
            <!-- ... -->

Types of content in the TEI header

Example source: al-Iqbāl #257, 27 July 1908

Front page of al-Iqbāl #257, 27 July 1908

Front page of al-Iqbāl #257, 27 July 1908

Example header: minimal required header

<teiHeader xml:lang="en">
            <title level="j" xml:lang="ar-Latn-x-ijmes">al-Iqbāl</title>
            <title type="sub">Digital Edition</title>
            <p>Unpublished example edition for the DHI Beirut 2015</p>
            <p>Transcribed from microfilm copies (classmark Mic-Na:164) located in the AUB library</p>

File description <fileDesc> 1

File description <fileDesc> 2

Title and responsibility statements

Within <titleStatement>, you can repeat any of these elements as necessary, and document additional responsbilities with a generic <respStmt>

<titleStmt xml:lang="en">
    <title level="j" xml:lang="ar-Latn-x-ijmes">al-Iqbāl</title>
    <title type="sub">Digital Edition</title>
    <author xml:lang="ar-Latn-x-ijmes">ʿAbd al-Bāsiṭ al-Unsī</author>
    <author xml:lang="ar-Latn-x-ijmes">Muḥammad al-Jisr</author>
        <resp>created the TEI files</resp>
        <name>Till Grallert</name>

Edition and extent statements

The publication statement <publicationStmt>

Example: publication statement

    <distributor>Digital Humanities Institute Beirut</distributor>
    <authority>Till Grallert</authority>
    <date from="2015-03-02" to="2015-03-06">2-6 March 2015</date>
        <licence>Licensed with a 
            <ref target="">Creative Commons Attribution</ref>

Series statement: <seriesStmt>

These include

Example: series statement

    <title level="s">Machine-Readable Texts for the Study of Indian Literature</title>
        <resp>ed. by</resp>
        <name>Jan Gonda</name>
    <biblScope unit="vol">1.2</biblScope>
    <idno type="ISSN">0 345 6789</idno>

Notes statement: <notesStmt>

The optional <notesStmt> can contain notes on almost any aspect of the file or its contents:

    <note>Transcribed for DHI Beirut TEI Workshop</note>

These notes can be short statements, or many parargaphs long. Take care to encode such information with more precise elements elsewhere in the TEI header, when such elements are available. For example, text types, such as reportage or detective stories, should be described under <profileDesc>

The source description statement: <sourceDesc>

All electronic works need to document their source, even 'born digital' ones! There are variety of elements you may draw from:

Example source: al-Iqbāl #257, 27 July 1908

Front page of al-Iqbāl #257, 27 July 1908

Front page of al-Iqbāl #257, 27 July 1908

Example: <sourceDesc>

<sourceDesc xml:lang="en">
            <title level="j" xml:lang="ar-Latn-x-ijmes">al-Iqbāl</title>
            <title level="j" xml:lang="ar">الاقبال </title>
            <title level="j" type="sub" xml:lang="ar">جريدة اسبوعية تصدر كل يوم الاثنين في <placeName>بيروت</placeName></title>
                <publisher xml:lang="ar-Latn-x-ijmes">ʿAbd al-Bāsiṭ al-Unsī</publisher>
                <publisher xml:lang="ar-Latn-x-ijmes">Muḥammad al-Jisr</publisher>
                <date when="1908-07-27">Mon, 27 July 1908</date>
                <biblScope type="vol">7</biblScope>
                <biblScope type="issue">257</biblScope>
                <biblScope type="pp">1-8</biblScope>
        <idno type="class-mark">Mic-Na:164</idno>

Association between header and text

By default everything asserted by a header is true of the text to which it is prefixed. This can be over-ridden:

Most components of the encoding description are declarable.

Encoding description: <encodingDesc>

<encodingDesc> groups notes about the procedures used when the text was encoded, either summarised in prose or within specific elements such as

Detailed notes in <encodingDesc> could be used to generate section of an editorial description.

Example: <encodingDesc> 1

<encodingDesc xml:lang="en">
        <p>Creation of a digital corpus of Arabic newspapers from Beirut published in the aftermath of the Young Turk Revolution</p>
            <p>Apparent errors have been marked as <sic>sic</sic> but correct readings are not provided</p>
            <p>Hyphenation is not common to Arabic texts</p>

Example: <encodingDesc> 1

        <taxonomy xml:id="part-of-speech">
            <category xml:id="adje">
                <category xml:id="AJ0">
                    <catDesc>adjective (unmarked) (e.g. GOOD, OLD)</catDesc>
                <category xml:id="AJC">
                    <catDesc>comparative adjective (e.g. BETTER, OLDER)</catDesc>
                <category xml:id="AJS">
                    <catDesc>superlative adjective (e.g. BEST, OLDEST)</catDesc>
            <category xml:id="AT0">
                <catDesc>article (e.g. THE, A, AN)</catDesc>
            <!-- ... -->

The tagging declaration: <tagsDecl>

The <tagsDecl> records elements namespace, tag frequency, information about the usage of particular tags not specified elsewhere, and default text appearance in source

The <rendition> element

consider this example:

    <rendition scheme="css" xml:id="r-center">text-align: center;</rendition>
    <rendition scheme="css" xml:id="r-small">font-size: small;</rendition>
    <rendition scheme="css" xml:id="r-large">font-size: large;</rendition>

which you can easily point to from the text

<hi rendition="#r-center #r-large">this bit of text was large and centred</hi>

But compare:

<hi rend="large center">this bit of text was large and centred</hi>

The profile description: <profileDesc>

A collection of descriptions, categorised only as ‘non-bibliographic’. Default members of the model.profileDescPart class include:

Example <creation>

    <date when="1918-05"/>
    <listChange ordered="true">
        <change xml:id="CHG-1">First stage, written in pencil in Owen's hand </change>
        <change xml:id="CHG-2">Second stage, revised in pencil in Owen's hand</change>
        <change xml:id="CHG-3">Fixation of the revised passages and further minor revisions by Owen using ink</change>
        <change xml:id="CHG-4">Addition of another stanza with a different ink, probably at a later stage</change>

Here <listChange> records stages in changes to the document. Further down, in <revisionDesc> the same element is used to record changes to the electronic file.

Language and character set usage

The <langUsage> element is provided to document usage of languages and writing systems in the text. Languages are identified by their ISO codes:

    <language ident="ar">Arabic</language>
    <language ident="ar-Latn-x-ijmes">Arabic transcribed into Latin script following the IJMES conventions</language>
    <language ident="ar-Latn-EN">Arabic transcribed into Latin script following common English practices</language>
    <language ident="ar-Latn-FR">Arabic transcribed into Latin script following common French practices</language>
    <language ident="en">English</language>
    <language ident="fa">Farsi</language>
    <language ident="fa-Latn-x-ijmes">Farsi transcribed into Latin script following the IJMES conventions</language>
    <language ident="fr">French</language>
    <language ident="ota">Ottoman</language>
    <language ident="ota-Latn-x-ijmes">Ottoman transcribed into Latin script following the IJMES conventions</language>
    <language ident="tr">Turkish</language>

The calendar description


    <calendar xml:id="cal_islamic">
        <p>Islamic <hi>hijrī</hi> calendar: lunar calendar beginning the Year with 1 Muḥarram.</p>
    <calendar xml:id="cal_julian">
        <p>Reformed Julian calendar beginning the Year with 1 January.</p>
    <calendar xml:id="cal_ottomanfiscal">
        <p>Ottoman fiscal calendar: a lunosolar calendar. It is based on the Old Julian calendar beginning the Year with 1 March and synchronised with <hi>hijrī</hi> year count every 33 years.</p>

Classification Methods

<textClass> groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc. using one or more of the following ways:

This categorization applies to the whole text. For more fine grained classification, use @decls on e.g. a <div> element to point to applicable variation in header.

Detailed characterization of a text

<textDesc> provides a description of a text in terms of its ‘Situational parameters’, a description of the situation whithin which the text was produced or experienced.

<textDesc n="novel">
    <channel mode="w">print; part issues</channel>
    <constitution type="single"/>
    <derivation type="original"/>
    <domain type="art"/>
    <factuality type="fiction"/>
    <interaction type="none"/>
    <preparedness type="prepared"/>
    <purpose degree="high" type="entertain"/>
    <purpose degree="medium" type="inform"/>

The participants description: <particDesc>

<particDesc> can just contain paragraphs of prose, or a more structured <person> element in <listPerson>

Example 1:

<particDesc xml:id="p2">
    <p>Female informant, well-educated, born in Shropshire UK, 12 Jan 1950, of unknown occupation. Speaks French fluently. Socio-Economic status B2 in the PEP classification scheme </p>

Example 2:

<particDesc xml:lang="ar">
        <head>People mentioned in the text</head>
        <person xml:id="pers-1" xml:lang="ar">
            <persName xml:lang="ar">عبد الحميد الثاني</persName>
                <date calendar="#cal_islamic" datingMethod="#cal_islamic" when="1842" when-custom="1258-08-16">١٦ شعبان ١٢٥٨ </date>
                <date when="1918">١٩١٨</date>
            <idno type="GND">118646435</idno>
            <idno type="VIAF">9880442</idno>
            <state calendar="#cal_ottomanfiscal" datingMethod="#cal_ottomanfiscal" notBefore-custom="1876-06-18" xml:lang="en">
                <p>Sultan of the Ottoman Empire, 1876-1909</p>
            <note xml:lang="en">
                <ref target="">Wikipedia article</ref>
        <person xml:id="pers-2">
            <persName xml:lang="ar">
                <forename xml:lang="ar">احمد</forename>
                <forename xml:lang="ar">حسن</forename>
                <surname xml:lang="ar">طباره</surname>
            <state xml:lang="en">
                <p>Editor of <orgName xml:lang="ar-Latn-x-ijmes">Thamarāt al-Funūn</orgName>.</p>

Revision Description: <revisionDesc>


    <change when="2015-02-07">Added detailed profileDesc containing information on languages, calendars, and persons</change>
    <change when="2015-02-06">Added mark-up</change>
    <change when="2015-02-05"><persName>Till Grallert</persName> Created file</change>

Some more metadata acronym soup

TEI provides a richer vocabulary than EAD or DCMI, and is less abstract than RDF or METS