Till Grallert
2 March 2015
The slides are based on those supplied by the various Digital Humanities Summer Schools at the University of Oxford under the Creative Commons Attribution license and have been adopted to the example of Arabic newspapers.
Slides were produced using MultiMarkDown, Pandoc, Slidy JS, and the Snippet jQuery Syntax highlighter.
Front page of al-Iqbāl #257, 27 July 1908
Front page of al-Bashīr #1868, 27 July 1908
Front page of al-Bashīr, 3 August 1908
Front page of Lisān al-Ḥāl #5773, 27 July 1908
Front page of Thamarāt al-Funūn #1683, 27 July 1908
All TEI documents are structured in a particular manner. This section attempts to describe the different variations on this as briefly as possible.
There are two basic types of TEI document:
<TEI>
contains a single TEI-conformant document, comprising a TEI header and a text, in various forms.<teiCorpus>
contains a TEI-encoded corpus, comprising a single corpus header and one or more <TEI>
elements, each containing its own header and a text.The text may be in the form of:
<facsimile>
: pictures of pages<sourceDoc>
: a pure transcription, or<text>
: an edited document<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!-- required -->
</teiHeader>
<facsimile>
<!-- optional-->
</facsimile>
<sourceDoc>
<!-- optional -->
</sourceDoc>
<text>
<!-- required if no facsimile or sourceDoc-->
</text>
</TEI>
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!-- required -->
</teiHeader>
<TEI>
<!-- required -->
</TEI>
<!-- More <TEI>elements -->
</teiCorpus>
<text>
elementWhat is a text? (remember that one?)
<front>
: optional front matter<body>
: (required)<back>
: optional back matterA simple document:
<text>
<front>
<!-- optional -->
</front>
<body>
<!-- required -->
</body>
<back>
<!-- optional -->
</back>
</text>
Newspaper issues are usually grouped into volumes (or years). If we consider them as a single composite text, we could treat each issue as a <div>
within it. Or (even better) we could use the <group>
element:
<text n="35" xml:id="v35" xml:lang="ar">
<front>
<!-- some introductory material for the current volume -->
</front>
<group>
<text n="1869" xml:id="v35-i1869" xml:lang="ar">
<front>
<!-- the masthead of issue 1869 -->
</front>
<body>
<!-- issue 1869 -->
</body>
</text>
<text n="1870" xml:id="v35-i1870" xml:lang="ar">
<front>
<!-- the masthead of issue 1870 -->
</front>
<body>
<!-- issue 1870 -->
</body>
</text>
</group>
<back>
<!-- volume index, appendices etc. -->
</back>
</text>
Each identifiable division within <text>
is a <div>
element. It can optionally be given a particular type (e.g. cartoon, verse, prose), using a free-text attribute.
For example, page 1 has two divisions:
<pb n="1"/>
<div type="article">
<p>....</p>
</div>
<div type="poem">
<head>Strange Meeting</head>
<lg>
<l>....</l>
</lg>
</div>
Because a division can start on one page and finish on another, or cross other physical boundaries
We use an empty element <pb/>
to mark the boundary between pages, rather than enclosing each page in a <div type="page">
.
<pb n="5"/>
<div type="article">
<p>...</p>
</div>
<div type="poem">
<head>Strange Meeting</head>
<lg> ...
<pb n="6"/>
...
</lg>
</div>
<div type="article">
<p>...</p>
</div>
<div type="postcard">
<div type="postmark">
<div type="advert">
<ab>BUY NATIONAL <lb/>WAR BONDS</ab>
</div>
<div type="dateStamp">
<dateline>
<placeName>SCARBOROUGH</placeName>
<lb/>
<time>6.30 PM</time>
<lb/>
</dateline>
</div>
<div type="advert">
<ab>BUY NATIONAL <lb/>WAR BONDS</ab>
</div>
</div>
<div type="address">
<!-- <address> here -->
</div>
<div type="prose">
<!-- text here -->
</div>
</div>
@type
attribute is used to label a particular level e.g. as ‘part’ or ‘chapter’@xml:id
attribute gives a particular division a unique identifier<div>
s must tesselate over the entire text
<div1>
<div2>
<!-- content -->
</div2>
<div2>
<!-- content -->
</div2>
</div1>
is valid, while
<div1>
<!-- content -->
<div2>
<!-- content -->
</div2>
<!-- content -->
</div1>
is not valid!
<div>
<head>Preface</head>
<p>
<!-- content of the div -->
</p>
<trailer>...</trailer>
</div>
The level can be made explicit by using ‘numbered’ divs (div1, div2). Opinions vary:
<div1>
vs. <div n="1">
<front>
, <body>
, or <back>
element.The <group>
element should be used to represent a collection of independent texts which is to be regarded as a single unit for processing or other purposes.
<floatingText>
contains a single text of any kind, whether unitary or composite, which interrupts the text containing it at any point and after which the surrounding text resumes.
The <floatingText>
element can appear within any division level element in the same way as a paragraph.
<p>She was thus ruminating, when a Gentleman enter'd the Room, the Door being a jar... calling for a Candle, she beg'd a thousandPardons, engaged him to sit down, and let her know, what had so long conceal'd him from her Correspondence. </p>
<pb n="5"/>
<floatingText>
<body>
<head>The Story of <hi>Captain Manly</hi></head>
<p>
<!-- Captain Manly's store here -->
</p>
</body>
</floatingText>
<pb n="37"/>
<p>The Gentleman having finish'd his Story ...
<!-- more -->
</p>
The order of XML encoding does not necessarily reflect the order of the source document. Compare:
<div type="postcard">
<div type="address">
<!-- <address>here -->
</div>
<div type="prose">
<!-- text here -->
</div>
<div type="postmark">
<div type="advert">
<ab>BUY NATIONAL <lb/>WAR BONDS</ab>
</div>
<div type="dateStamp">
<dateline>
<placeName>SCARBOROUGH</placeName>
<lb/><time>6.30 PM</time>
<lb/>
</dateline>
</div>
<div type="advert">
<ab>BUY NATIONAL <lb/>WAR BONDS</ab>
</div>
</div>
</div>
The core module of the TEI groups together elements which may appear in any kind of text and the tags used to mark them in all TEI documents. This includes:
<p>
: paragraph; marks paragraphs in prose
<p>
can contain all the phrase-level elements in the core<p>
can appear directly inside <body>
or inside <div>
Example
<p>ترجمة التلغراف السامي الوارد من مقام الصدارة العظمى
<lb/><quote>صدرت ارادة حضرة صاحب الخلافة العظمى
<lb/>بان يدعى الى الاجتماع مجلس المبعوثان المبينة كيفية
<lb/>تشكيله في القانون الاساسي الذي هو من تأسيس
<lb/>حضرة الخليفة الاعظم وبما انه ابلغ حكم هذه الارادة
<lb/>السنية الجليل الى جميع الولايات الشاهانية
<lb/>المتصرفيات غير الملحقة فعليكم باجراء انتخاب اعضاء
<lb/>حائزين الضفات المندرجة في <rs>القانون المذكور</rs></quote>
في<date>١٠ تموز سنة ١٣٢٤</date></p>
By highlighting we mean the use of any combination of typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings. For words and phrases which are:
<hi>
: general purpose highlighting;<distinct>
: linguistically distinct<emph>
, <mentioned>
, <soCalled>
, <term>
and <gloss>
Example
<calendar xml:id="cal_islamic">
<p>Islamic <hi rend="italics">hijrī</hi>calendar: lunar calendar beginning the Year with 1 <hi rend="italics">Muḥarram</hi>. Dates differ between locations as the beginning of the month is based on sightings of the new moon.</p>
<p>E.g.
<date calendar="#cal_islamic" datingMethod="#cal_islamic" when="1841-05-23" when-custom="1257-04-01">1 Rab II 1257, Sunday</date>,
<date calendar="#cal_islamic" datingMethod="#cal_islamic" when="1908-03-05" when-custom="1326-02-01">1 Ṣaf 1326,
Thursday</date>.
</p>
</calendar>
Quotation marks can be used to set off text for many reasons, so the TEI has the following elements:
<q>
: indicated by quotation marks<said>
: speech or thought<quote>
: passage attributed to an external source<cit>
: groups a quotation and citation<bibl>
is used to give the source of a quoteExample
<quote>
<bibl>ترجمة التلغراف السامي الوارد من مقام الصدارة العظمى</bibl>
<lb/>صدرت ارادة حضرة صاحب الخلافة العظمى
<lb/>بان يدعى الى الاجتماع مجلس المبعوثان المبينة كيفية
<lb/>تشكيله في القانون الاساسي الذي هو من تأسيس
<lb/>حضرة الخليفة الاعظم وبما انه ابلغ حكم هذه الارادة
<lb/>السنية الجليل الى جميع الولايات الشاهانية
<lb/>المتصرفيات غير الملحقة فعليكم باجراء انتخاب اعضاء
<lb/>حائزين الضفات المندرجة في <rs>القانون المذكور</rs><
bibl>في <date>١٠ تموز سنة ١٣٢٤</date></bibl>
</quote>
<list>
: a sequence of items forming a list<item>
: one component of a list<label>
: label associated with an item<p>
<hi>To which is added,</hi>A Collection of LETTERS of Friendship, and other Occasional LETTERS, written by
<list>
<item>
Mr.
<hi>Dryden,</hi></item>
<item>Mr.
<hi>Wycherly,</hi></item>
<item>Mr.—</item>
<item>Mr.
<hi>Congreve,</hi></item>
<item>Mr.
<hi>Dennis,</hi>
and other Hands.</item>
</list>
</p>
<note>
: contains a note or annotationExample:
<note place="foot">Painted by <persName>John Singer Sargent</persName>, 1.918</note>
<choice>
and friends<choice>
: groups alternative editorial encodings<sic>
: apparent error<corr>
: corrected error<orig>
: original form<reg>
: regularized form<abbr>
: abbreviated form<expan>
: expanded form<dateline xml:lang="ar">
<date calendar="#cal_julian" datingMethod="#cal_julian" when="1908-07-27" when-custom="1908-07-14">١٤ تموز
<choice>
<abbr>ش</abbr>
<expan>شرقي</expan>
</choice>
</date>
<date>٢٧
<choice>
<abbr>غ</abbr>
<expan>غربي</expan>
</choice>سنة ١٩٠٨</date>
</dateline>
Consider: “Excuse me sir, but would you like to buy a nice little dawg?”
We can:
<orig>
to show that “dawg” is what it says, even though this is a nonstandard spelling<reg>
to show that “dog” is an editorially-supplied regularisation of what it says<choice>
element to say either is a valid encodingExample:
...a nice little <choice><orig>dawg</orig><reg>dog</reg></choice>?
<add>
: addition to the text, e.g. marginal gloss<del>
: phrase marked as deleted in the text<gap>
: indicates point where material is omitted<unclear>
: contains text unable to be transcribed clearly<p><add place="left">My </add>
<del rend="stroked">It's </del>
<add place="above">
<del rend="stroked">The </del>
</add>subject <del rend="stroked">of</del> is War, and the
<unclear>pity </unclear>
of <del rend="stroked">it</del> War.
<lb/>The Poetry is in the pity.</p>
<name>
: a name in the text, contains a proper noun or noun phrase<rs>
: a general-purpose name or referencing stringThe @type attribute is useful for categorizing these, and they both also have @key, @ref, and @nymRef attributes.
<email>
: an electronic mail address<address>
: a postal address<addrLine>
: a non-specific address line<street>
: a full street address<postCode>
: a postal (or zip) code<postBox>
: a postal box number<name>
can also be used<num>
: marks a number of any sort<measure>
: marks a quantity or commodity<measureGrp>
: groups specifications relating to a single object<num>
has simple @type and @value attributes, <measure>
has @type, @quantity, @unit and @commodity attributesExample: numbers and measures
<l>With a <num value="1000">thousand</num> pains that vision's face was grained;</l>
... only <measure type="distance" unit="m" quantity="3218.69">two miles</measure> from the front....
<date>
: contains a date in any format and includes a @when attribute for a regularised form and a @calendar attribute to specify what calendar system<time>
: contains a time in any format and includes a @when attribute for a regularised formExample
<date when="1917-07">July 1917.<lb/> Wednesday</date>
<ptr>
: defines a pointer to another location<ref>
: defines a reference to another location, with optional linking textExample
See <ref target="#Section12">section 12 on page 34</ref>.
See <ptr target="#Section12"/>.
The <ref target="http://www.bbc.co.uk/">BBC web site</ref> has a good sports section
<index>
(marks an index entry) with optional @indexName attribute<term>
element is used to mark a term inside an <index>
element<index>
element can self-nest for hierarchical index entriesExample
<p>Last week I wrote (to order) a strong <lb/>bit of Blank<index>
<term>Verse</term>
<index>
<term>Blank Verse</term>
</index>
</index>:</p>
<graphic>
: indicates the location of an inline graphic, illustration, or figure<binaryObject>
: encoded binary data embedding a graphic or other object<figure>
and <figDesc>
for more complex graphicsExample
<div type="article" xml:lang="ar">
<head>تعريب الفرمان العالي السلطاني</head>
<figure>
<graphic url="#facs-2-1-z-1"/>
<head xml:lang="en">The Ottoman Tughra</head>
<figDesc>Reproduction of the Ottoman coat of arms / Sultanic seal</figDesc>
</figure>
<q>افتخار الاعلام والاعظام مختار الاكابر والافخم مستجمع جميع المعالي</q>
</div>
Ṭughrā at the head of the Qānūn al-Asāsī in Thamarāt al-Funūn, 27 July 1908
<lg type="stanza">
<l>It seemed that out of battle I escaped</l>
<l>Down some profound dull tunnel, long since scooped</l>
<l>Through granites which titanic wars had groined.</l>
</lg>
<lg type="stanza">
<l>Yet also there encumbered sleepers groaned, </l>
<l>Too fast in thought or death to be bestirred. </l>
<l>Then, as I probed them, one sprang up, and stared </l>
<l>With piteous recognition in fixed eyes, </l>
<l>Lifting distressful hands, as if to bless. </l>
<l>And by his smile, I knew that sullen hall,--- </l>
<l>By his dead smile I knew we stood in Hell.</l>
</lg>
And now we’re going to move on to another exercise where you get to apply some of the more structural elements you have learned about.