Croissant helps standardise ML datasets

8 March 2024

On 6 March 2024, MLCommons (an Artificial Intelligence engineering consortium) announced the release of Croissant, a metadata format to help standardise machine learning (ML) datasets. The aim of Croissant is to make datasets easily discoverable and usable across tools and platforms. This is highly relevant in the European Open Science Cloud (EOSC) tasks on FAIR data sustainability and important for Linked Data in general.

Data is at the core of every Artificial Intelligence (AI) and ML model. However, there is currently no standardised method of organising and arranging the data and files that make up each dataset. As a result, finding, understanding, and using ML datasets can be tedious and time-consuming. One of the goals of Croissant is to make data more easily accessible and discoverable. 

The Croissant standard was developed collaboratively by a community from industry and academia, as part of the MLCommons effort. We contributed to the development of the Croissant specification and provided valuable input to address FAIR related issues, provenance information and Responsible AI. This Croissant release includes the format documentation, an open source library, and visual editor, with industry support from HuggingFace, Google Dataset Search, Kaggle, and OpenML amongst others. 

The Croissant vocabulary is an extension to, a machine-readable standard to describe structured data, used by more than 40 million datasets on the web, which allows the datasets to be discoverable through dataset search engines such as Google Dataset Search. Croissant is easy to adopt because it doesn’t require changing the data itself or how it is represented. Instead, Croissant adds a layer of metadata that represents the contents of the dataset in a standardised way, describing key attributes and properties.

Benefits for Data Stations and DataverseNL

DANS, in collaboration with Harvard Institute for Quantitative Social Science (IQSS), provides Croissant support in Dataverse. This enhancement is funded by ODISSEI and SSHOC-NL projects and will be available in the next release for all partners within the Dataverse network, as well as for the DANS Data Stations and DataverseNL

We believe that this functionality will enhance interoperability between data produced by academic and industrial parties, stored in different places, to work together seamlessly. The addition of a semantic layer, aligned with FAIR principles, will further help improve data quality in the long run. It will also provide academic researchers with access to industry data, enabling proper citations.

Croissant supports the DANS Data Stations by cataloguing, integrating, and enhancing machine learning, AI, and manual data enrichment tools. This significantly increases the opportunities for SSH researchers with varying levels of technical expertise, among others, to use automatic enrichment tools in a FAIR and methodologically sound manner.

Croissant is made possible thanks to efforts by the MLCommons Croissant working group, which includes contributors from these organisations: Bayer, cTuning Foundation, DANS, Dotphoton, Google, Harvard, Hugging Face, Kaggle, King’s College London – Open Data Institute, Meta, NASA, NASA IMPACT – UAH, North Carolina State University, Open University of Catalonia –  Luxembourg Institute of Science and Technology, Sage Bionetworks, and TU Eindhoven.

If you want to know more about the tool, go to the website of MLCommons or Google Dataset Search. [1]

[1] Benjelloun, O.; Simperl, E.; Marcenac, P.;, Ruyssen, P.; Conforti, C.; Kuchnik, M.; Van der Velde, J.; Oala, L.; Vogler, S.;Akthar, M.; Jain, N.; Tykhonov, V. (2024) Croissant Format Specification. V1.0. Permalink

Do you have questions about this news item?

Your name(Required)
This field is for validation purposes and should be left unchanged.

Vyacheslav Tykhonov M.Sc.

Research & Development Engineer