Till Grallert
2 Jun 2015
The slides are based on those supplied by the various Digital Humanities Summer Schools at the University of Oxford under the Creative Commons Attribution license and have been adopted to the needs of the 2015 Introduction to TEI at DHSI.
Slides were produced using MultiMarkDown, Pandoc, Slidy JS, and the Snippet jQuery Syntax highlighter.
TEI requires metadata to be stored inside the XML document, prefixed to the content. This information comprises the TEI header although, as we will see, some can be included inside the <body>
.
<teiHeader>
The TEI header was designed with two goals in mind
The result is that discussion of the header tends to be pulled in two directions…
The TEI header has four main components:
<fileDesc>
(file description) contains a full bibliographic description of a computer file. (“computer file” may actually correspond with several files across different operating system.)<encodingDesc>
(encoding description) documents the relationship between an electronic text and the source or sources from which it was derived.<profileDesc>
(text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting. (just about everything not covered in the other header elements<revisionDesc>
(revision description) summarizes the revision history for a file.Note: Only <fileDesc>
is required; the others are optional.
<teiHeader>
<fileDesc>
<titleStmt>
<title>A title?</title>
</titleStmt>
<publicationStmt>
<p>Who published?</p>
</publicationStmt>
<sourceDesc>
<p>Where from?</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!-- some metadata relating to the file, the letters etc. are kept in -->
<fileDesc>
<titleStmt>
<title>PRO FO 618/3, Despatches to Constantinople, quarterly report, Hedjaz railway, reports on local politics etc., 1908</title>
</titleStmt>
<!-- ... -->
</fileDesc>
<!-- ... -->
</teiHeader>
<TEI xml:id="ProFo_618-3_Damascus_19081001_1" xml:lang="en">
<teiHeader>
<!-- metadate relating to the individual letter etc. -->
</teiHeader>
<facsimile>
<!-- links to image files -->
</facsimile>
<text>
<!-- transcription of the document -->
</text>
</TEI>
<!-- More <TEI>elements -->
</teiCorpus>
<subjectDecl>
, <refsDecl>
) enclose information about specific encoding practices applied in the electronic text.<settingDesc>
, <projectDesc>
) contain a prose description, possibly, but not necessarily, organised under some specific headings by suggested sub-elements.Kawkab America #55, 28 Apr 1893, p.1 (English)
Kawkab America #55, 28 Apr 1893, p.1 (Arabic)
<teiHeader xml:lang="en">
<fileDesc>
<titleStmt>
<title level="j">Kawkab America</title>
<title type="sub">Digital Edition</title>
</titleStmt>
<publicationStmt>
<p>Unpublished example edition for the DHSI 2015</p>
</publicationStmt>
<sourceDesc>
<p>Transcribed from digitised microfilm copies provided by the Center for Research Libraries</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<fileDesc>
1<titleStmt>
: provides a title for the resource and any associated statements of responsibility<sourceDesc>
: documents the sources from which the encoded text derives (if any)<publicationStmt>
: documents how the encoded text is published or distributed<editionStmt>
: yes, digital texts have editions too<seriesStmt>
: and they also fit into “series”.<extent>
: how many CDs, gigabytes, files?<notesStmt>
: notes of various types<fileDesc>
2<titleStmt>
: contains a mandatory <title>
which identifies the electronic file (not its source!)<author>
, <editor>
, <sponsor>
, <funder>
, <principal>
or the generic <respStmt>
<publicationStmt>
: may contain
<publisher>
, <distributor>
, <authority>
, each followed by <pubPlace>
, <address>
, <availability>
, <idno>
Within <titleStatement>
, you can repeat any of these elements as necessary, and document additional responsbilities with a generic <respStmt>
<titleStmt xml:lang="en">
<title level="j">Kawkab America</title>
<title type="sub">Digital Edition</title>
<editor><persName>Dr. A. Arbeely</persName></editor>
<editor><persName>N.J. Arbeely</persName></editor>
<respStmt>
<resp>created the TEI files</resp>
<name>Till Grallert</name>
</respStmt>
</titleStmt>
<editionStmt>
<extent>
<publicationStmt>
<publisher>
, <distributor>
and/or <authority>
must be present unless the entire publication statement is given as prose<profileDesc>
, not in the <publicationStmt>
<licence>
included in <availability>
<publicationStmt>
<publisher>UVic</publisher>
<distributor>Digital Humanities Summer Institute</distributor>
<authority>Till Grallert</authority>
<pubPlace>Victoria, BC, Canada</pubPlace>
<date from="2015-06-01" to="2015-06-05">1-5 June 2015</date>
<availability>
<licence>Licensed with a <ref target="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution</ref> licence.</licence>
</availability>
</publicationStmt>
<seriesStmt>
These include
<seriesStmt>
<title level="s">Machine-Readable Texts for the Study of Indian Literature</title>
<respStmt>
<resp>ed. by</resp>
<name>Jan Gonda</name>
</respStmt>
<biblScope unit="vol">1.2</biblScope>
<idno type="ISSN">0 345 6789</idno>
</seriesStmt>
<notesStmt>
The optional <notesStmt>
can contain notes on almost any aspect of the file or its contents:
<notesStmt>
<note>Transcribed for DHSI 2015 TEI Workshop</note>
</notesStmt>
These notes can be short statements, or many parargaphs long. Take care to encode such information with more precise elements elsewhere in the TEI header, when such elements are available. For example, text types, such as reportage or detective stories, should be described under <profileDesc>
<sourceDesc>
All electronic works need to document their source, even ‘born digital’ ones! There are variety of elements you may draw from:
<p>
<bibl>
(bibliographic citation): contains free text and/or any mixture of bibliographic elements such as <author>
, <publisher>
etc.<biblStruct>
(structured) contains similar elements but constrained in various ways according to bibliographic standards<biblFull>
(fully-structured) special-cases texts which were born TEI by replicating an embedded <fileDesc>
<listBibl>
may be used for lists of such descriptions, e.g. bibliographies<recordingStmt>
etc.) and for manuscripts (<msDesc>
) <listPerson>
, <listPlace>
, <listOrg>
Kawkab America #55, 28 Apr 1893, p.1 (English)
Kawkab America #55, 28 Apr 1893, p.1 (Arabic)
<sourceDesc>
<sourceDesc>
<biblStruct>
<monogr>
<title level="j" xml:lang="ar-Latn-x-ijmes">Kawkab Amīrkā</title>
<title level="j" xml:lang="en">The Star of America</title>
<title level="j" type="alternative" xml:lang="en">Kawkab America</title>
<title level="j" type="sub" xml:lang="ar-Latn-x-ijmes">Jarīda siyāsiyya ʿilmiyya tijāriyya adabiyya</title>
<editor xml:lang="ar-Latn-x-ijmes">
<persName><addName type="title">Dr.</addName> <forename>Ibrāhīm</forename> <surname>Arbīlī</surname></persName></editor>
<editor xml:lang="ar-Latn-x-ijmes">
<persName><forename>Najīb</forename> <forename>Yūsuf</forename> <surname>Arbīlī</surname></persName></editor>
<imprint>
<publisher xml:lang="ar-Latn-x-ijmes">al-Maṭbaʿat al-Sharqiyya</publisher>
<pubPlace xml:lang="ar-Latn-x-ijmes">Niyū Yūrk</pubPlace>
<publisher xml:lang="en">The Oriental Publishing House</publisher>
<pubPlace xml:lang="en">New York</pubPlace>
<date notAfter="1894-04-14" notBefore="1893-04-15" xml:lang="en">1893-1894</date>
<biblScope unit="volume">2</biblScope>
</imprint>
</monogr>
<idno type="callNumber"><!-- ideally this source should have a referenceable identifier --></idno>
</biblStruct>
</sourceDesc>
By default everything asserted by a header is true of the text to which it is prefixed. This can be over-ridden:
@decls
attribute (available on all model.declaring elements)<catRef>
and <taxonomy>
Most components of the encoding description are declarable.
<encodingDesc>
<encodingDesc>
groups notes about the procedures used when the text was encoded, either summarised in prose or within specific elements such as
<projectDesc>
: goals of the project<samplingDecl>
: sampling principles<editorialDecl>
: editorial principals, e.g. <correction>
, <normalization>
, <quotation>
, <hyphenation>
, <segmentation>
, <interpretation>
<classDecl>
: classification system/s used<tagsDecl>
: specifics about usage of particular elementsDetailed notes in <encodingDesc>
could be used to generate section of an editorial description.
<encodingDesc>
1<encodingDesc xml:lang="en">
<projectDesc>
<p>Creation of a digital corpus of Arabic newspapers from Beirut published in the aftermath of the Young Turk Revolution</p>
</projectDesc>
<editorialDecl>
<correction>
<p>Apparent errors have been marked as <sic>sic</sic> but correct readings are not provided</p>
</correction>
<hyphenation>
<p>Hyphenation is not common to Arabic texts</p>
</hyphenation>
</editorialDecl>
</encodingDesc>
<encodingDesc>
1<encodingDesc>
<classDecl>
<taxonomy xml:id="part-of-speech">
<category xml:id="adje">
<catDesc>adjectives</catDesc>
<category xml:id="AJ0">
<catDesc>adjective (unmarked) (e.g. GOOD, OLD)</catDesc>
</category>
<category xml:id="AJC">
<catDesc>comparative adjective (e.g. BETTER, OLDER)</catDesc>
</category>
<category xml:id="AJS">
<catDesc>superlative adjective (e.g. BEST, OLDEST)</catDesc>
</category>
</category>
<category xml:id="AT0">
<catDesc>article (e.g. THE, A, AN)</catDesc>
</category>
<!-- ... -->
</taxonomy>
</classDecl>
</encodingDesc>
<tagsDecl>
The <tagsDecl>
records elements namespace, tag frequency, information about the usage of particular tags not specified elsewhere, and default text appearance in source
<rendition>
element<rendition>
: structured information about appearance in the source documentconsider this example:
<tagsDecl>
<rendition scheme="css" xml:id="r-center">text-align: center;</rendition>
<rendition scheme="css" xml:id="r-small">font-size: small;</rendition>
<rendition scheme="css" xml:id="r-large">font-size: large;</rendition>
</tagsDecl>
which you can easily point to from the text
<hi rendition="#r-center #r-large">this bit of text was large and centred</hi>
But compare:
<hi rend="large center">this bit of text was large and centred</hi>
<profileDesc>
A collection of descriptions, categorised only as ‘non-bibliographic’. Default members of the model.profileDescPart class include:
<creation>
: information about the origination of the intellectual content of the text, e.g. time and place<langUsage>
: information about languages, registers, writing systems etc. used in the text; this is particularly important to our example<calendarDesc>
: information about calendars and dating methods as used in the text; this is particularly important to our example<textDesc>
and <textClass>
: classifications applied to the text by means of a list of specified criteria or by means of a collection of pointers, respectively<particDesc>
and <settingDesc>
: information about the ‘participants’, either real or depicted, in the text<handNotes>
: information about the particular style or hand distinguished within a manuscript<creation>
<creation>
<date when="1918-05"/>
<placeName>Ripon</placeName>
<listChange ordered="true">
<change xml:id="CHG-1">First stage, written in pencil in Owen's hand </change>
<change xml:id="CHG-2">Second stage, revised in pencil in Owen's hand</change>
<change xml:id="CHG-3">Fixation of the revised passages and further minor revisions by Owen using ink</change>
<change xml:id="CHG-4">Addition of another stanza with a different ink, probably at a later stage</change>
</listChange>
</creation>
Here <listChange>
records stages in changes to the document. Further down, in <revisionDesc>
the same element is used to record changes to the electronic file.
The <langUsage>
element is provided to document usage of languages and writing systems in the text. Languages are identified by their ISO codes:
<langUsage>
<language ident="ar">Arabic</language>
<language ident="ar-Latn-x-ijmes">Arabic transcribed into Latin script following the IJMES conventions</language>
<language ident="ar-Latn-EN">Arabic transcribed into Latin script following common English practices</language>
<language ident="ar-Latn-FR">Arabic transcribed into Latin script following common French practices</language>
<language ident="en">English</language>
<language ident="fa">Farsi</language>
<language ident="fa-Latn-x-ijmes">Farsi transcribed into Latin script following the IJMES conventions</language>
<language ident="fr">French</language>
<language ident="ota">Ottoman</language>
<language ident="ota-Latn-x-ijmes">Ottoman transcribed into Latin script following the IJMES conventions</language>
<language ident="tr">Turkish</language>
</langUsage>
<calendarDesc>
contains a description of the calendar system used in any dating expression found in the text. This element may contain one or more <calendar>
elements, but it is generally presumed that the absence of any specific declaration signifies the Gregorian calendar<calendar>
element contains one or more paragraphs of description for the calendar system concerned, and also supplies an identifying code for it as the value of its @xml:id attribute.Example:
<calendarDes>
<calendar xml:id="cal_islamic">
<p>Islamic <hi>hijrī</hi> calendar: lunar calendar beginning the Year with 1 Muḥarram.</p>
</calendar>
<calendar xml:id="cal_julian">
<p>Reformed Julian calendar beginning the Year with 1 January.</p>
</calendar>
<calendar xml:id="cal_ottomanfiscal">
<p>Ottoman fiscal calendar: a lunosolar calendar. It is based on the Old Julian calendar beginning the Year with 1 March and synchronised with <hi>hijrī</hi> year count every 33 years.</p>
</calendar>
</calendarDes>
<textClass>
groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc. using one or more of the following ways:
<catRef>
: direct reference to a locally defined (e.g.in the corpus header) category<classCode>
: reference to some standard and externally defined classification scheme<keywords>
: assign arbitrary descriptive terms taken from a bibliographic controlled vocabulary or a tag cloudThis categorization applies to the whole text. For more fine grained classification, use @decls on e.g. a <div>
element to point to applicable variation in header.
<textDesc>
provides a description of a text in terms of its ‘Situational parameters’, a description of the situation whithin which the text was produced or experienced.
<textDesc n="novel">
<channel mode="w">print; part issues</channel>
<constitution type="single"/>
<derivation type="original"/>
<domain type="art"/>
<factuality type="fiction"/>
<interaction type="none"/>
<preparedness type="prepared"/>
<purpose degree="high" type="entertain"/>
<purpose degree="medium" type="inform"/>
</textDesc>
<particDesc>
<particDesc>
can just contain paragraphs of prose, or a more structured <person>
element in <listPerson>
Example 1:
<particDesc xml:id="p2">
<p>Female informant, well-educated, born in Shropshire UK, 12 Jan 1950, of unknown occupation. Speaks French fluently. Socio-Economic status B2 in the PEP classification scheme </p>
</particDesc>
Example 2:
<particDesc xml:lang="en">
<listPerson>
<head>People mentioned in the text</head>
<person xml:id="pers_1">
<persName xml:lang="ar">عبد الحميد الثاني</persName>
<persName xml:lang="en">Abdulhamid II</persName>
<birth>
<date calendar="#cal_islamic" datingMethod="#cal_islamic" when="1842"
when-custom="1258-08-16" xml:lang="ar">١٦ شعبان ١٢٥٨ </date>
</birth>
<death>
<date when="1918" xml:lang="ar">١٩١٨</date>
</death>
<idno type="GND">118646435</idno>
<idno type="VIAF">9880442</idno>
<state calendar="#cal_ottomanfiscal" datingMethod="#cal_ottomanfiscal"
notBefore-custom="1876-06-18" xml:lang="en">
<p>Sultan of the Ottoman Empire, 1876-1909</p>
</state>
<note xml:lang="en">
<ref target="https://en.wikipedia.org/wiki/Abdul_Hamid_II">Wikipedia
article</ref>
<ref target="http://d-nb.info/gnd/118646435">PND</ref>
</note>
</person>
<person xml:id="pers_2">
<persName xml:lang="ar"><addName type="title">الدكتور</addName> <forename>ابراهيم</forename> <surname>عربيلي</surname></persName>
<persName xml:lang="en"><addName type="title">Dr.</addName> <forename>Abraham</forename> <surname>Arbeely</surname></persName>
<state from="1892-04-15" xml:lang="en">
<p>Editor of <orgName>Kawkab America</orgName>.</p>
</state>
</person>
<person xml:id="pers_3">
<persName xml:lang="ar">
<forename>نجيب</forename> <forename>يوسف</forename> <surname>عربيلي</surname></persName>
<persName xml:lang="en">
<forename>Najeeb</forename> <forename>Joseph</forename> <surname>Arbeely</surname></persName>
<state from="1892-04-15" xml:lang="en">
<p>Editor of <orgName>Kawkab America</orgName>.</p>
</state>
</person>
</listPerson>
</particDesc>
<revisionDesc>
<change>
elements, each with a @date
and @who
attributes, indicating significant stages in the evolution of a document. Most recent first.<creation>
it is about the document.Example:
<revisionDesc>
<change when="2015-02-07">Added detailed profileDesc containing information on languages, calendars, and persons</change>
<change when="2015-02-06">Added mark-up</change>
<change when="2015-02-05"><persName>Till Grallert</persName> Created file</change>
</revisionDesc>
TEI provides a richer vocabulary than EAD or DCMI, and is less abstract than RDF or METS
That was a whirlwind tour of the kinds of things you can put in the <teiHeader>
. However, be aware that adding some of the other TEI modules can expand what is allowed in the header. (e.g. Manuscript Description)
Now let’s do an exercise where we make an existing <teiHeader>
element better!