A workflow to transform Arabic classical works in printed form to structured text.

Status: Ongoing
Duration: 2020 - 2021

The FUSUS project was initiated by Cornelis van Lit, veni researcher at Utrecht University, founder of the website The Digital Orientalist, and author of the book Among Digitized Manuscripts. He describes the goal of this project as follows:

“Medieval Arabic texts, especially those from intellectual history (philosophy, natural theology, theoretical mysticism) are sorely underrepresented in current digital text databases. There are, however, many (critical or not so critical) editions available of these texts. We therefore wanted to advance the use of printed editions to automatically create digital texts. Commentary writing is the practice of taking an earlier text and interspersing it with additional text. A commentary tradition is the standardization of such a practice on a given earlier text. This phenomenon is widely perceived in late medieval and early modern Islamic intellectual history. Ibn Arabi’s Fusus al-hikam is an example of such a source text, on which dozens, perhaps hundreds, of commentaries were written through the centuries. We wish to revive our understanding of this corpus. The base text and fourteen commentaries have been edited and published. Their thousands of pages, millions of words, can quickly blur into each other one by one, as differences between commentaries can be narrowly small. We propose to turn this into a digital corpus, which, we believe, can be achieved within a small timespan and simply with technology currently available.”

On behalf of DANS, Dirk Roorda worked in close cooperation with Cornelis van Lit on an OCR pipeline that can convert printed Arabic pages into data with structured text. There was also the challenge of extracting orderly text from an idiosyncratic PDF. An overview of the problems along the way and their solutions can be found in a technical report in release 0.5 of the GitHub repository  among/fusus, the interim result of this project. Not only the code is stored here, but also the output data, not to mention extensive documentation and a collection of Jupyter notebooks. The contribution of DANS was funded by the Innovation fund for Innovation fund for IT in research projects of Utrecht University.

Cornelis continues the research, and the project continues to grow in the same repository, which is now also archived in Zenodo and the Software Heritage Archive. The preliminary results have also been converted to Text-Fabric. New and improved results will be included. This provides an interesting way of using Fusus in research and education in the Digital Humanities.

