Developments of the Croissant Machine Learning
Today, on ‘International Croissant Day’ we look at the developments of Croissant ML, a metadata format designed to standardise machine learning (ML) datasets. What is new and what does the near future foresee?
What is Croissant?
On 6 March 2024, MLCommons introduced Croissant. This initiative, a collaboration between various organisations and academic institutions, including DANS, aims to make datasets easier to find and use—an essential step for projects like the European Open Science Cloud (EOSC) and Linked Data. Some major updates have been done in the meantime.
Croissant builds on schema.org, a standard already used by millions of datasets. The format adds an extra layer of metadata that describes datasets in a standardised way without altering the underlying data itself. This enables datasets to be easily discovered through platforms like Google Dataset Search.
DANS, in collaboration with the Harvard Institute for Quantitative Social Science, has integrated Croissant support into Dataverse. This enhancement, funded by the ODISSEI and SSHOC-NL projects, is now available to all Dataverse partners, including the DANS Data Stations and DataverseNL.
Recent developments
Since the launch of Croissant, several new developments have emerged:
- Publication of the Croissant Specification
The specification includes a comprehensive vocabulary and an open-source Python library to validate, generate, and consume Croissant metadata. - Adoption by Dataset Repositories
Platforms such as Kaggle, Hugging Face, and OpenML now support the Croissant format, increasing interoperability and adoption within the ML community. - Integration with ML Frameworks
TensorFlow, PyTorch, and other frameworks now support Croissant through packages like TensorFlow Datasets (TFDS). - Controlled vocabulary support
DANS is collaborating with other partners to enhance external controlled vocabulary support in Croissant, enabling the semi-automated linkage of concepts from existing vocabularies available in the European Open Science Cloud (EOSC) and published on platforms such as Skosmos, OntoPortal, Getty, and others. This contribution will increase FAIR interoperability for all Croissant datasets and improve the data landscape in both industry and academia. It will also enable Machine Learning frameworks to better understand the context leading to increasing the quality and effectiveness of their outputs.
Innovation in Muse-IT: Croissant helps interpret transcripts
One of the most exciting developments is within the Muse-IT project, where Croissant is being used to correct messy video transcripts and place them in the correct context. This application demonstrates how Croissant enables AI to achieve some level of “understanding”, particularly regarding people, organisations, and locations mentioned in videos.
During a presentation for the Horizon 2020 funded MuseIT project consortium meeting in London, examples were shown of how transcripts were automatically corrected and stored in Dataverse.
The Future: voice-driven search
As the next step, a voice-driven search feature is being introduced. This will allow users to ask questions directly, with AI selecting and playing relevant videoclips in response. This marks a new dimension in data interaction, making video content more accessible than ever.
The ongoing development of Croissant highlights how standardisation plays a crucial role in advancing AI innovation. By making datasets more accessible, understandable, and usable, Croissant opens the door to new opportunities in research, education, and industry.
FAIR and Open dataRDMTools