Computational Approaches to Mundane Texts in Under-Resourced Languages

The Case of Arabic Periodicals

Till Grallert

Scholarly Makerspace

Humboldt-Universität zu Berlin, Universitätsbibliothek, Grimm-Zentrum

Combining Social and Cultural with Digital History: Methodological Challenges and Practical Strategies

2023-03-07

https://tillgrallert.github.io/slides/dh/2023-bochum/

Background

Arabic periodicals

  • Periodical press as agent of change
    • first mass medium
    • central medium of the literary and cultural Arabic renaissance (nahḍa)
    • medium of linguistic change
    • central forum for negotiations over modernity, nationalism, Islamism etc.
  • Periodicals as source but not a subject
  • Research is dominated by
    • national(ist) narratives
    • bias on two places and small no. of titles
    • implicit hypotheses
Figure 1: Distribution of new Arabic periodical titles, 1799–1929

Research interest: intellectual networks

Figure 2: Undirected network of authors in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, and al-Muqtabas. Colour of nodes: betweenness centrality; size of nodes: number of periodicals; width of edges: number of articles.

Aims

  • empirical testing of hypotheses
  • evaluate existing literature

Observations

  • very limited overlap between periodicals from the same place
  • core network (14 of 319 nodes):
    • absent from the literature
    • suprising set up: many Iraqis (6), few Syrians (2), few Christians (2)

Research interest: intellectual networks

Figure 3: Direkted network of periodical titles mentioned in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, and al-Muqtabas; weighted by issues. Colour and size of nodes: in-degree.
Figure 4: Map of publication places for the periodical titles mentioend in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, and al-Muqtabas

Data requirements

modelled text

  • e.g. “The newspaper al-ʿAṣr al-Jadīd from Damascus reported in its last issue that …”
  • (semi)automated extraction based on
    • named entity recognition (NER)
  • problems
    • state of OCR/HTR
    • state of Layout recognition
    • state of NER

structured bibliographic metadata

  • e.g. “Sātisnā dispatched this report from al-Shahbāʾ
  • (semi)automated extraction based on
    • presence of information in the material artefact
    • a modelled digital surrogate
  • problems
    • absence of explicit information

authority files / norm data

  • Sātisnā
    • Pseudonym and anagramme of Anastās al-Karmilī, editor of Lughat al-ʿArab in Baghdad
  • al-Shahbāʾ, “the grey”
    • one of the epithets of Aleppo
    • geo coordinates: 36.20124, 37.16117
  • problems
    • bias on Global North in form and content

Closing some <gap/>(s)

Project Jarāʾid (2012–)
Closing the knowledge <gap/>

  • Bibliographic record of all Arabic periodical titles published between 1798 and 1929
    • websits and open datasets (TEI XML) for more than 3500 periodicals
    • additional authority files for c.2700 persons, 220 places, 180 libraries
  • Unfunded collaboration with Adam Mestyan (Duke), “crowd”-sourcing
  • Ongoing since 2021/22: Integration of holding information from library catalogues such as ZDB, AUB, BnF, HathiTrust
Figure 5: Periodicals by places of publication. Size of circles corresponds to the number of periodicals. Colour indicates the collection status: known (grün), digitised (blau), reminder (rot).

Collection and digitisation biases

Figure 6: Periodicals and their holding institutions
Table 1: Periodical holdings and digitization
periodicals –1918 –1929
published 2054 3550
known holdings 540 775
% of total 26.29 21.83
———————— ——– ——–
digitized 156 233
% of total 7.59 6.56
———————— ——– ——–
multiple digitizations 51 66
% of total 2.48 1.86
% of digitized 32.69 28.33

The digitisation bias compared

Table 2: Comparison of digitized periodicals between the Global South and the Global North
Arabic periodicals (1798–1918) WWI as mirrored by Hessian regional papers
community c. 420 million Arabic speakers c. 6.2 million inhabitants
periodicals 2054 newspapers and journals 125 newspapers
digitized 156 periodicals 125 newspapers with more than 1.5 million pages
type mostly facsimiles facsimiles and full text
access paywalls, geo-fencing open access
interface mostly foreign languages only local and foreign languages

Accessibility

Catalogue searches

No Arabic script

Figure 7: ZDB Suche nach “الجنة”

Which Latinized transcription was used?

Figure 8: ZDB Suche nach “al-Ǧanna”

What are the normalization rules for the search algorithm?

Figure 9: ZDB Suche nach “Ganna”

Accessibility

Interfaces

Figure 10: Interface of the Translatio project (Bonn). Facsimile of Arabic original on the left. Yellow = English UI; purple = Arabic metadata in DMG transcription; green = German metadata

Accessibility

cataloging rules and algorithmic copyright detection cause further inaccessibilities

Figure 11: al-Muqtabas 6 on HathiTrust (Original in Princeton) outside the USA
Figure 12: The page from fig. 11 with a US-IP

Accessibility

Text layers

For old prints, there’s […] kraken/calamari for coders, Transkribus if you’ve got money and just want to have the results[,] and OCR-D if you’ve got an IT department.

(Winkler Mastodon post 2023)

  • Unstructured text, no APIs, propriertary interfaces
  • Algorithms and evaluation are kept secret
    • unknown numbers of false positives and false negatives
al-Muqtabas 6 on HathiTrust, quality of the OCR layer (requires US IP)
al-Bashīr 9 Jan. 1880 (#487), p.1 on GPA, quality of the OCR layer

Closing the infrastructural <gap/>
Open Arabic Periodical Editions (OpenArabicPE, 2015–)

approach

  • combine available facsimiles and transcriptions into an open standard format
  • scrape, generate, and share open bibliographic metadata
  • with the affordances of the Global South

aims

  • validate available transcriptions for scholarly (re)use
  • develop an open infrastructure of models, workflows, authority files

principles

  • established tools and technologies
  • as few as possible, open, and simple tools and formats
  • free-to-use platforms without lock-in

OpenArabicPE

Infrastructure

  1. Digital scholarly editions, authority files: TEI/XML.
  2. Open licenses: CC BY-SA 4.0 (TEI, MODS, BibTeX), MIT license (XSLT, XQuery)
  3. Social digital editions, hosted on GitHub:
  4. Archived on Zenodo: DOI for stable citability
  5. Static webviews: parallel display of text and facsimile
  6. Bibliographic metadata, hosted through a Zotero group.

OpenArabicPE

Corpus

Title Place Proprietor DOI Volumes Issues Articles Words
al-Ḥaqāʾiq Damascus Abd al-Qādir al-Iskandarānī 10.5281/zenodo.1232016 3 35 389 298090
al-Ḥasnāʾ Beirut Niqūlā Bāz 10.5281/zenodo.3556246 1 12 201 NA
al-Manār Cairo Muḥammad Rashīd Riḍā 35 537 4300 6144593
al-Muqtabas Cairo, Damascus Muḥammad Kurd ʿAlī 10.5281/zenodo.597319 9 96 2964 1981081
al-Ustādh Cairo Abdallāh Nadīm al-Idrīsī 10.5281/zenodo.3581028 1 42 435 221447
al-Zuhūr Cairo Anṭūn al-Jumayyil 10.5281/zenodo.3580606 4 39 436 292333
Lughat al-ʿArab Baghdad Anastās Mārī al-Karmalī 10.5281/zenodo.3514384 3 34 939 373832
total 56 795 9664 9311376

Attempts at a digital/computational history

SIHAFA

aims

  • systematic analysis of the late Ottoman Arabic press at scale
  • development and evaluation of computational methods
  • challenging established narratives
  • establishing “Arab Periodical Studies”

questions

  • who are the central actors (people, periodicals) in this discursive field?
  • how were periodicals produced and how to conceptualise “authorship”?
  • what is the role of text reuse and how did texts, topics, and genres travel?
  • how did the language of modernity develop in this multilingual, imperial space?

methods

  • social network analysis (SNA)
  • stilometric authorship attribution
  • historical GIS
  • layout analysis
  • topic modelling

1. Historical GIS

1. Historical GIS: typology of periodicals

Hypothesis: distribution of geographic origin of contributions to a periodical is an indicator for its importance

trans-regional

Figure 14: Places in bylines from al-Muqtabas (Cairo and Damascus)

regional

Figure 15: Places in bylines from al-Ḥasnāʾ (Beirut)

local

Figure 16: Places in bylines fromal-Ḥaqāʾiq (Damascus)

Historical GIS

Required data

  • Initial sources: OpenArabicPE
  • Marked-up place names in the modelled text
    • bylines
    • reviews
    • problems: no functional NER for late Ottoman Arabic
  • authority files for disambiguation and enriching data
    • geo-referenced places
    • problems: lack of historical gazetteers
  • Byline: Maryam Zakā from Sayda
 <byline>
    <placeName ref="oape:place:9 geon:268064">صيدا</placeName>
    <persName ref="oape:pers:2845">مريم زكا</persName>
</byline>
  • Gazetteer entry for Sayda
<place type="town" xml:id="place_9">
    <placeName type="simple">Saida</placeName>
    <placeName xml:lang="ar-Latn-x-ijmes">Ṣaydā</placeName>
    <placeName xml:lang="en">Sidon</placeName>
    <placeName xml:lang="ar">صيدا</placeName>
    <location>
        <geo>33.55751, 35.37148</geo>
    </location>
    <idno type="url">http://en.wikipedia.org/wiki/Sidon</idno>
    <idno type="geon">268064</idno>
    <idno type="oape">9</idno>
</place>

2. Social network analysis

2. Social network analysis: referenced periodicals

Figure 17: Direkted network of periodical titles mentioned in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, and al-Muqtabas; weighted by issues. Colour and size of nodes: in-degree. source

aim

  • empirical testing of hypotheses
  • guidance for future digitisation efforts

results

  • mainly self-referential
  • typology: extent of outward looking
  • core network (44 of 465)
    • surprising members
    • highly concentrated on a few locations

Social network analysis: referenced periodicals

Required data

  • Initial sources: OpenArabicPE, Project Jarāʾid, OCR
  • Mark-up of all references to periodicals in the modelled text:
    • semi-automatic (regex): tracks pattern of “مجلة …”, “جريدة …
    • problems: lack of functional NER for Arabic
  • authority files for disambiguation and enriching data
    • bibliography
    • problems: absence from existing authority files
  • the journal al-Zuhūr from Cairo
والأصح الدرعية بلام التعريف (راجع <bibl subtype="journal" type="periodical">مجلة <title level="j" ref="oape:bibl:3 oclc:1034545644">الزهور</title> المصرية  <biblScope unit="volume" from="2" to="2">٢</biblScope> :  <biblScope unit="page" from="292">٢٩٢</biblScope></bibl>)
  • the newspaper al-Zuhūr from Baghdad
وانتخب <persName>فؤاد أفندي الدفتري البغدادي</persName> و<bibl><editor><persName>نوري أفندي</persName></editor> راس كتاب <textLang otherLangs="ota">القسم التركي</textLang> في <bibl type="periodical" subtype="newspaper">جريدة <title ref="oape:bibl:532">الزهور</title></bibl> البغدادية</bibl> نائبين عن <placeName ref="oape:place:372 geon:94824">كربلاء</placeName>.

2. Social network analysis: authors

Figure 18: Undirected network of authors in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, and al-Muqtabas. Colour of nodes: betweenness centrality; size of nodes: number of periodicals; width of edges: number of articles.

aim

  • empirical testing of hypotheses
  • guidance for close reading

results

  • very limited overlap between periodicals from the same place
  • core network (14 of 319 nodes):
    • absent from the literature
    • suprising set up: many Iraqis (6), few Syrians (2), few Christians (2)

Netzwerkanalyse: Autor_innen

Required data

  • Initial data: OpenArabicPE, Project Jarāʾid
  • structured bibliographic data
    • semi-automatic on the basis of the editions
    • manual recording
    • problems: many accronyms, multiple name forms
  • authority files for disambiguation and enriching data
    • life dates
    • works in library catalogues
    • problems: absence from existing authority files
  • personography entry for Père Anastase-Marie de Saint-Elie, who was mostly referenced as Sātisnā in out sources
<person>
    <persName><roleName type="pseudonym">ساتسنا</roleName></persName>
    <persName><roleName type="pseudonym">أمكح</roleName></persName>
    <persName><roleName type="pseudonym">فهر الجابري</roleName></persName>
    <persName><roleName type="rank">الأب</roleName> <forename>أنستاس</forename> <forename>ماري</forename> <surname><addName type="nisbah">الكرملي</addName></surname></persName>
    <persName><forename>أنستاس</forename> <forename>ماري</forename> <addName type="nisbah">الألياوي</addName> <surname><addName type="nisbah">الكرملي</addName></surname></persName>
    <persName><forename>بطرس</forename> <addName type="nasab">بن <forename>جبرائيل</forename></addName> <forename>يوسف</forename> <surname>عواد</surname></persName>
    <idno type="VIAF">39370998</idno>
    <idno type="oape">227</idno>
    <idno type="wiki">Q4751824</idno>
    <birth><date source="viaf" when="1866-08-05">1866-08-05</date> in <placeName ref="oape:place:216 geon:98182">Baghdad</placeName></birth>
    <death><date source="viaf" when="1947-01-07">1947-01-07</date> in <placeName ref="oape:place:216 geon:98182">Baghdad</placeName></death>
</person>

Network of authors maps only the tip of the iceberg

state of research

  • Commonly ignored in scholarship
  • Common implicit hypothesis: the publisher-cum-editor wrote all articles in “his” periodical

About 4/5 of all articles or 2/3 of all words carry no byline

challenges

  • hypothesis is implausible and untested
  • we do not know the names of all potential authors
  • stylometric authorship attribution was untested for 19th century Arabic (Romanov and Grallert “Parameters for Stylometric Authorship Attribution” 2022)

3. Stylometric authorship attribution

3. Stylometric authorship attribution

Authorship signal is prevalent in most frequent words, i.e. function words

comparative method

  • steps:
    1. compute frequencies for every text
    2. compare every text with every text
    3. validate through voting (consensus) of multiple iterations

challenges

  • novel application to Arabic and this genre
  • comparison depends on input
  • reliability depends on a minimal length of texts

Stylometry

  • In R with the stylo() package (Eder, Rybicki, and Kestemont “Stylometry with R” 2016)
  • Based on parameter settings established in our tests (Romanov and Grallert “Parameters for Stylometric Authorship Attribution” 2022)

stylo() settings

  • Tokens: words
  • Sampling: 2500 tokens
  • Most Frequent Features: 200–500 tokens, incremented by 100
  • Culling: 0
  • distance measure: Eder’s simple delta

Analysis

  • edges (and nodes) tables from stylo()
  • computing network measures with tidygraph() and igraph()
    • centrality
    • community detection
  • plotting results with ggraph() and ggplot2()

owners-cum-editors as authors?

al-Muqtabas

Figure 19: Anonmyous sections in al-Muqtabas and articles by potential editors, coloured by author (blue = Muḥammad Kurd ʿAlī, red = Kāẓim al-Duhaylī, green = Anastās al-Karmalī)
  • Muḥammad Kurd ʿAlī most likely not the author
  • Neither Anastās Karmalī or Kāẓim al-Dujaylī from Lughat al-ʿArab

owners-cum-editors as authors?

al-Muqtabas

Figure 20: Anonmyous sections in al-Muqtabas and articles by potential editors, coloured by community

Multiple anonymous candidates?

owners-cum-editors as authors?

Lughat al-ʿArab

Figure 21: Anonmyous sections in Lughat al-ʿArab and articles by potential editors, coloured by author (blue = Muḥammad Kurd ʿAlī, red = Kāẓim al-Duhaylī, green = Anastās al-Karmalī)

Authorship of Anastās Mārī al-Karmalī and Kāẓim al-Duyalī more likely

owners-cum-editors as authors?

Lughat al-ʿArab

Figure 22: Anonmyous sections in Lughat al-ʿArab and articles by potential editors, coloured by community

Authorship of Anastās Mārī al-Karmalī and Kāẓim al-Duyalī more likely

stylistic differences between journals

Importance of genre

5 periodicals from OpenArabicPE + 6 works by one of the editors
  • very limited similarity between al-Muqtabas and its editor Muḥammad Kurd ʿAlī

Thank you!

Thank you!

  • Contributors to Project Jarāʾid: Hala Auji, Philippe Chevrant, Marina Demetriadou, Lamia Eid, Stacy Fahrenthold, Ulrike Freitag, Rana Issa, Nicole Khayat, Peter Magierski, Leyla von Mende, Adam Mestyan, Christian Meier, Daniel Newman, Geoffrey Roper, Sinai Rusinek, Philip Sadgrove, Ola Seif, and Rogier Visser

  • Contributors to OpenArabicPE: Jasper Bernhofer, Dimitar Dragnev, Patrick Funk, Talha Güzel, Hans Magne Jaatun, Jakob Koppermann, Xaver Kretzschmar, Daniel Lloyd, Klara Mayer, Tobias Sick, Manzi Tanna-Händel, and Layla Youssef

  • Maxim Romanov for his work on parameter testing

  • Contributors to OCR: Adam Mestyan, Sinai Rusinek

  • Links:

  • Licence: slides and plots are licenced as CC BY-SA 4.0

Literatur

Eder, Maciej, Jan Rybicki, and Mike Kestemont. 2016. “Stylometry with R: A Package for Computational Text Analysis.” The R Journal 8 (1): 107–21. https://doi.org/10.32614/RJ-2016-007.
Grallert, Till. 2021. “Catch Me If You Can! Approaching the Arabic Press of the Late Ottoman Eastern Mediterranean Through Digital History.” Edited by Simone Lässig. Geschichte Und Gesellschaft 47 (1, Digital History): 58–89. https://doi.org/gkhrjr.
———. 2022. “Open Arabic Periodical Editions: A Framework for Bootstrapped Scholarly Editions Outside the Global North.” Edited by Roopika Risam and Alex Gil. Digital Humanities Quarterly 16 (2, "Minimal Computing"). http://digitalhumanities.org/dhq/vol/16/2/000593/000593.html.
Romanov, Maxim, and Till Grallert. 2022. “Establishing Parameters for Stylometric Authorship Attribution of 19th-Century Arabic Books and Periodicals.” In Book of Abstracts, 346–48. Tokyo. https://dh-abstracts.library.virginia.edu/works/11858.
Winkler, Alexander (@awinkler@openbiblio.social). 2023. Mastodon post. Mastodon. https://openbiblio.social/@awinkler/109981107178749600.