The slides are based on those supplied by the various Digital Humanities Summer Schools at the University of Oxford under the Creative Commons Attribution license and have been adopted to the example of Arabic newspapers.
Slides were produced using MultiMarkDown, Pandoc, and the Slidy JS code of the W3C.
TEI requires metadata to be stored inside the XML document, prefixed to the content. This information comprises the TEI header although, as we will see, some can be included inside the <body>
.
<teiHeader>
The TEI header was designed with two goals in mind
The result is that discussion of the header tends to be pulled in two directions...
The TEI header has four main components:
<fileDesc>
(file description) contains a full bibliographic description of a computer file. ("computer file" may actually correspond with several files across different operating system.)<encodingDesc>
(encoding description) documents the relationship between an electronic text and the source or sources from which it was derived.<profileDesc>
(text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting. (just about everything not covered in the other header elements<revisionDesc>
(revision description) summarizes the revision history for a file.Only <fileDesc>
is required; the others are optional.
<teiHeader>
<fileDesc>
<titleStmt>
<title>A title?</title>
</titleStmt>
<publicationStmt>
<p>Who published?</p>
</publicationStmt>
<sourceDesc>
<p>Where from?</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
<!-- Add xmlns and version in <teiCorpus>-->
<teiHeader type="corpus">
<!-- corpus-level metadata here -->
<!-- Must contain one TEI header for the corpus. -->
</teiHeader>
<TEI>
<teiHeader type="text">
<!-- metadata specific to this text here -->
<!-- Must contain a series of TEI elements, one for each text. -->
</teiHeader>
<text>
<!-- ... -->
</text>
</TEI>
<TEI>
<teiHeader type="text">
<!-- metadata specific to this text here -->
</teiHeader>
<text>
<!-- ... -->
</text>
</TEI>
</teiCorpus>
<subjectDecl>
, <refsDecl>
) enclose information about specific encoding practices applied in the electronic text.<settingDesc>
, <projectDesc>
) contain a prose description, possibly, but not necessarily, organised under some specific headings by suggested sub-elements.Front page of al-Iqbāl #257, 27 July 1908
<teiHeader xml:lang="en">
<fileDesc>
<titleStmt>
<title level="j" xml:lang="ar-Latn-x-ijmes">al-Iqbāl</title>
<title type="sub">Digital Edition</title>
</titleStmt>
<publicationStmt>
<p>Unpublished example edition for the DHI Beirut 2015</p>
</publicationStmt>
<sourceDesc>
<p>Transcribed from microfilm copies (classmark Mic-Na:164) located in the AUB library</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<fileDesc>
1<titleStmt>
: provides a title for the resource and any associated statements of responsibility<sourceDesc>
: documents the sources from which the encoded text derives (if any)<publicationStmt>
: documents how the encoded text is published or distributed<editionStmt>
: yes, digital texts have editions too<seriesStmt>
: and they also fit into "series".<extent>
: how many CDs, gigabytes, files?<notesStmt>
: notes of various types<fileDesc>
2<titleStmt>
: contains a mandatory <title>
which identifies the electronic file (not its source!)<author>
, <editor>
, <sponsor>
, <funder>
, <principal>
or the generic <respStmt>
<publicationStmt>
: may contain
<publisher>
, <distributor>
, <authority>
, each followed by <pubPlace>
, <address>
, <availability>
, <idno>
Within <titleStatement>
, you can repeat any of these elements as necessary, and document additional responsbilities with a generic <respStmt>
<titleStmt xml:lang="en">
<title level="j" xml:lang="ar-Latn-x-ijmes">al-Iqbāl</title>
<title type="sub">Digital Edition</title>
<author xml:lang="ar-Latn-x-ijmes">ʿAbd al-Bāsiṭ al-Unsī</author>
<author xml:lang="ar-Latn-x-ijmes">Muḥammad al-Jisr</author>
<respStmt>
<resp>created the TEI files</resp>
<name>Till Grallert</name>
</respStmt>
</titleStmt>
<editionStmt>
<extent>
<publicationStmt>
<publisher>
, <distributor>
and/or <authority>
must be present unless the entire publication statement is given as prose<profileDesc>
, not in the <publicationStmt>
<licence>
included in <availability>
<publicationStmt>
<publisher>AUB</publisher>
<distributor>Digital Humanities Institute Beirut</distributor>
<authority>Till Grallert</authority>
<pubPlace>Beirut</pubPlace>
<date from="2015-03-02" to="2015-03-06">2-6 March 2015</date>
<availability>
<licence>Licensed with a
<ref target="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution</ref>
licence.</licence>
</availability>
</publicationStmt>
<seriesStmt>
These include
<seriesStmt>
<title level="s">Machine-Readable Texts for the Study of Indian Literature</title>
<respStmt>
<resp>ed. by</resp>
<name>Jan Gonda</name>
</respStmt>
<biblScope unit="vol">1.2</biblScope>
<idno type="ISSN">0 345 6789</idno>
</seriesStmt>
<notesStmt>
The optional <notesStmt>
can contain notes on almost any aspect of the file or its contents:
<notesStmt>
<note>Transcribed for DHI Beirut TEI Workshop</note>
</notesStmt>
These notes can be short statements, or many parargaphs long. Take care to encode such information with more precise elements elsewhere in the TEI header, when such elements are available. For example, text types, such as reportage or detective stories, should be described under <profileDesc>
<sourceDesc>
All electronic works need to document their source, even 'born digital' ones! There are variety of elements you may draw from:
<p>
<bibl>
(bibliographic citation): contains free text and/or any mixture of bibliographic elements such as <author>
, <publisher>
etc.<biblStruct>
(structured) contains similar elements but constrained in various ways according to bibliographic standards<biblFull>
(fully-structured) special-cases texts which were born TEI by replicating an embedded <fileDesc>
<listBibl>
may be used for lists of such descriptions, e.g. bibliographies<recordingStmt>
etc.) and for manuscripts (<msDesc>
) <listPerson>
, <listPlace>
, <listOrg>
Front page of al-Iqbāl #257, 27 July 1908
<sourceDesc>
<sourceDesc xml:lang="en">
<biblStruct>
<monogr>
<title level="j" xml:lang="ar-Latn-x-ijmes">al-Iqbāl</title>
<title level="j" xml:lang="ar">الاقبال </title>
<title level="j" type="sub" xml:lang="ar">جريدة اسبوعية تصدر كل يوم الاثنين في <placeName>بيروت</placeName></title>
<imprint>
<publisher xml:lang="ar-Latn-x-ijmes">ʿAbd al-Bāsiṭ al-Unsī</publisher>
<publisher xml:lang="ar-Latn-x-ijmes">Muḥammad al-Jisr</publisher>
<date when="1908-07-27">Mon, 27 July 1908</date>
<pubPlace>Beirut</pubPlace>
<biblScope type="vol">7</biblScope>
<biblScope type="issue">257</biblScope>
<biblScope type="pp">1-8</biblScope>
</imprint>
</monogr>
<idno type="class-mark">Mic-Na:164</idno>
</biblStruct>
</sourceDesc>
By default everything asserted by a header is true of the text to which it is prefixed. This can be over-ridden:
<catRef>
and <taxonomy>
Most components of the encoding description are declarable.
<encodingDesc>
<encodingDesc>
groups notes about the procedures used when the text was encoded, either summarised in prose or within specific elements such as
<projectDesc>
: goals of the project<samplingDecl>
: sampling principles<editorialDecl>
: editorial principals, e.g. <correction>
, <normalization>
, <quotation>
, <hyphenation>
, <segmentation>
, <interpretation>
<classDecl>
: classification system/s used<tagsDecl>
: specifics about usage of particular elementsDetailed notes in <encodingDesc>
could be used to generate section of an editorial description.
<encodingDesc>
1<encodingDesc xml:lang="en">
<projectDesc>
<p>Creation of a digital corpus of Arabic newspapers from Beirut published in the aftermath of the Young Turk Revolution</p>
</projectDesc>
<editorialDecl>
<correction>
<p>Apparent errors have been marked as <sic>sic</sic> but correct readings are not provided</p>
</correction>
<hyphenation>
<p>Hyphenation is not common to Arabic texts</p>
</hyphenation>
</editorialDecl>
</encodingDesc>
<encodingDesc>
1<encodingDesc>
<classDecl>
<taxonomy xml:id="part-of-speech">
<category xml:id="adje">
<catDesc>adjectives</catDesc>
<category xml:id="AJ0">
<catDesc>adjective (unmarked) (e.g. GOOD, OLD)</catDesc>
</category>
<category xml:id="AJC">
<catDesc>comparative adjective (e.g. BETTER, OLDER)</catDesc>
</category>
<category xml:id="AJS">
<catDesc>superlative adjective (e.g. BEST, OLDEST)</catDesc>
</category>
</category>
<category xml:id="AT0">
<catDesc>article (e.g. THE, A, AN)</catDesc>
</category>
<!-- ... -->
</taxonomy>
</classDecl>
</encodingDesc>
<tagsDecl>
The <tagsDecl>
records elements namespace, tag frequency, information about the usage of particular tags not specified elsewhere, and default text appearance in source
<rendition>
element<rendition>
: structured information about appearance in the source documentconsider this example:
<tagsDecl>
<rendition scheme="css" xml:id="r-center">text-align: center;</rendition>
<rendition scheme="css" xml:id="r-small">font-size: small;</rendition>
<rendition scheme="css" xml:id="r-large">font-size: large;</rendition>
</tagsDecl>
which you can easily point to from the text
<hi rendition="#r-center #r-large">this bit of text was large and centred</hi>
But compare:
<hi rend="large center">this bit of text was large and centred</hi>
<profileDesc>
A collection of descriptions, categorised only as ‘non-bibliographic’. Default members of the model.profileDescPart class include:
<creation>
: information about the origination of the intellectual content of the text, e.g. time and place<langUsage>
: information about languages, registers, writing systems etc. used in the text; this is particularly important to our example<calendarDesc>
: information about calendars and dating methods as used in the text; this is particularly important to our example<textDesc>
and <textClass>
: classifications applied to the text by means of a list of specified criteria or by means of a collection of pointers, respectively<particDesc>
and <settingDesc>
: information about the ‘participants’, either real or depicted, in the text<handNotes>
: information about the particular style or hand distinguished within a manuscript<creation>
<creation>
<date when="1918-05"/>
<placeName>Ripon</placeName>
<listChange ordered="true">
<change xml:id="CHG-1">First stage, written in pencil in Owen's hand </change>
<change xml:id="CHG-2">Second stage, revised in pencil in Owen's hand</change>
<change xml:id="CHG-3">Fixation of the revised passages and further minor revisions by Owen using ink</change>
<change xml:id="CHG-4">Addition of another stanza with a different ink, probably at a later stage</change>
</listChange>
</creation>
Here <listChange>
records stages in changes to the document. Further down, in <revisionDesc>
the same element is used to record changes to the electronic file.
The <langUsage>
element is provided to document usage of languages and writing systems in the text. Languages are identified by their ISO codes:
<langUsage>
<language ident="ar">Arabic</language>
<language ident="ar-Latn-x-ijmes">Arabic transcribed into Latin script following the IJMES conventions</language>
<language ident="ar-Latn-EN">Arabic transcribed into Latin script following common English practices</language>
<language ident="ar-Latn-FR">Arabic transcribed into Latin script following common French practices</language>
<language ident="en">English</language>
<language ident="fa">Farsi</language>
<language ident="fa-Latn-x-ijmes">Farsi transcribed into Latin script following the IJMES conventions</language>
<language ident="fr">French</language>
<language ident="ota">Ottoman</language>
<language ident="ota-Latn-x-ijmes">Ottoman transcribed into Latin script following the IJMES conventions</language>
<language ident="tr">Turkish</language>
</langUsage>
<calendarDesc>
contains a description of the calendar system used in any dating expression found in the text. This element may contain one or more <calendar>
elements, but it is generally presumed that the absence of any specific declaration signifies the Gregorian calendar<calendar>
element contains one or more paragraphs of description for the calendar system concerned, and also supplies an identifying code for it as the value of its @xml:id attribute.Example:
<calendarDes>
<calendar xml:id="cal_islamic">
<p>Islamic <hi>hijrī</hi> calendar: lunar calendar beginning the Year with 1 Muḥarram.</p>
</calendar>
<calendar xml:id="cal_julian">
<p>Reformed Julian calendar beginning the Year with 1 January.</p>
</calendar>
<calendar xml:id="cal_ottomanfiscal">
<p>Ottoman fiscal calendar: a lunosolar calendar. It is based on the Old Julian calendar beginning the Year with 1 March and synchronised with <hi>hijrī</hi> year count every 33 years.</p>
</calendar>
</calendarDes>
<textClass>
groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc. using one or more of the following ways:
<catRef>
: direct reference to a locally defined (e.g.in the corpus header) category<classCode>
: reference to some standard and externally defined classification scheme<keywords>
: assign arbitrary descriptive terms taken from a bibliographic controlled vocabulary or a tag cloudThis categorization applies to the whole text. For more fine grained classification, use @decls on e.g. a <div>
element to point to applicable variation in header.
<textDesc>
provides a description of a text in terms of its ‘Situational parameters’, a description of the situation whithin which the text was produced or experienced.
<textDesc n="novel">
<channel mode="w">print; part issues</channel>
<constitution type="single"/>
<derivation type="original"/>
<domain type="art"/>
<factuality type="fiction"/>
<interaction type="none"/>
<preparedness type="prepared"/>
<purpose degree="high" type="entertain"/>
<purpose degree="medium" type="inform"/>
</textDesc>
<particDesc>
<particDesc>
can just contain paragraphs of prose, or a more structured <person>
element in <listPerson>
Example 1:
<particDesc xml:id="p2">
<p>Female informant, well-educated, born in Shropshire UK, 12 Jan 1950, of unknown occupation. Speaks French fluently. Socio-Economic status B2 in the PEP classification scheme </p>
</particDesc>
Example 2:
<particDesc xml:lang="ar">
<listPerson>
<head>People mentioned in the text</head>
<person xml:id="pers-1" xml:lang="ar">
<persName xml:lang="ar">عبد الحميد الثاني</persName>
<birth>
<date calendar="#cal_islamic" datingMethod="#cal_islamic" when="1842" when-custom="1258-08-16">١٦ شعبان ١٢٥٨ </date>
</birth>
<death>
<date when="1918">١٩١٨</date>
</death>
<idno type="GND">118646435</idno>
<idno type="VIAF">9880442</idno>
<state calendar="#cal_ottomanfiscal" datingMethod="#cal_ottomanfiscal" notBefore-custom="1876-06-18" xml:lang="en">
<p>Sultan of the Ottoman Empire, 1876-1909</p>
</state>
<note xml:lang="en">
<ref target="https://en.wikipedia.org/wiki/Abdul_Hamid_II">Wikipedia article</ref>
</note>
</person>
<person xml:id="pers-2">
<persName xml:lang="ar">
<forename xml:lang="ar">احمد</forename>
<forename xml:lang="ar">حسن</forename>
<surname xml:lang="ar">طباره</surname>
</persName>
<state xml:lang="en">
<p>Editor of <orgName xml:lang="ar-Latn-x-ijmes">Thamarāt al-Funūn</orgName>.</p>
</state>
</person>
</listPerson>
</particDesc>
<revisionDesc>
<change>
elements, each with a @date and @who attributes, indicating significant stages in the evolution of a document. Most recent first.<creation>
it is about the document.Example:
<revisionDesc>
<change when="2015-02-07">Added detailed profileDesc containing information on languages, calendars, and persons</change>
<change when="2015-02-06">Added mark-up</change>
<change when="2015-02-05"><persName>Till Grallert</persName> Created file</change>
</revisionDesc>
TEI provides a richer vocabulary than EAD or DCMI, and is less abstract than RDF or METS
That was a whirlwind tour of the kinds of things you can put in the <teiHeader>
, however be aware that adding some of the other TEI modules can expand what is allowed in the header. (e.g. Manuscript Description) Now let's do an exercise where we make an existing <teiHeader>
element better!