CLARIAH Tech & Data Day: FAIR data in the Dutch humanities
On 28 September, DANS hosted the second CLARIAH Tech & Data Day, which focussed on FAIR data management in the humanities. Here is a recap of the day.
The CLARIAH community is engaged in building digital infrastructure for humanities research in the Netherlands. The aim of the Tech & Data Day is to bring together scholars, research support and software engineers, to learn from each other and enhance digital resources in the humanities. The focus was on data management in the humanities: how to find and reuse data, and what is still needed for this in terms of technology and policy.
Jetze Touber (DANS) Data Station Manager Humanities and moderator of the day, exposed the audience to the DANS Data Station Social Sciences and Humanities (Data Station SSH). Launched in June 2023, the Data Station is still being added to and improved upon in various respects. The Social Sciences and Humanities metadata, for example, are still tilted towards the Social Sciences, with an emphasis on survey research. Those present were asked to perform a number of tasks in the Data Station SSH, like searching for data and depositing a sample dataset. This occasioned discussion of the prioritization of search results, the exact significance of certain metadata elements, documentation of temporal and spatial coverage, and vocabularies. This provided very useful input for further improvements to search functionalities and metadata elements in the Data Station, which DANS will work on in 2024.
Presentations of the Heritage Data research projects, funded by CLARIAH, explained how datasets originating in cultural heritage institutions are being curated for compatibility with the CLARIAH infrastructure. Leon van Wissen (UvA) talked about the FAIR Photos project, which deals with the collection of a photo press agency. The extensive descriptions existing of the thousands of photos in this collection are being structured and turned into linked data, thus creating possibilities to query the collection, linking it with other resources, and performing computational research on it. Ruben Peeters (UAntwerpen) presented the Tracing Wealth project, which engages with the fiscal registers summarizing the death duties to be paid by Dutch citizens who deceased, during 1921. The central task in this project is to link testators mentioned in the tax registers to individuals occurring in civic registries, increasing the dataset’s interoperability. This opens up possibilities for research into geographical and intergenerational distribution of wealth.
Data routes and nodes
Discussions subsequently shifted to a more abstract perspective on FAIRness. Menzo Windhouwer (KNAW-Humanities Cluster) informed the audience of the technical workflow which CLARIAH is developing to channel data to its infrastructure, making them findable and accessible for humanities researchers to work with. In particular, there are different sources from which relevant data can originate, which require different technical provisions for metadata to arrive in a central place, the CLARIAH Data Registry. The workflow thus distinguishes between various routes with different metadata standards and communication protocols. Angelica Maineri (ODISSEI / EUR) then took over, opening up a discussion of the organizational features of implementing FAIR data management in the humanities and the social sciences. She observed that there is a multitude of stakeholders involved in generating and publishing research data, which calls for a careful distinction of who is responsible for which aspect of FAIR data management. The ensuing debate centered on how a research community might be defined, and where the nodes can be located where data flows converge.
Chatting with data
Finally Slava Tykhonov (DANS) gave the audience a sneak preview of future developments. He presented his experiments with connecting Large Language Models with the knowledge graph created on top of the metadata records of datasets deposited in Dataverse repositories. In this way, an end user would have the opportunity to ‘chat’ with a data repository, querying and potentially combining datasets using day-to-day language. At the same time, it gives Large Language Models a source of validated knowledge and provenance information, enhancing the possibilities of end users to assess the value of responses they get when interacting with chatbots built on those Large Language Models.
For more information, you can contact Jetze Touber.
Social Sciences and Humanities
FAIR and Open dataRDMCollaborationsTools